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"Fault attacks have recently become a serious concern in the smart card industry. 
"Fault Tolerant Systems" provides the reader with a clear exposition of these at¬ 
tacks and the protection strategies that can be used to thwart them. A must read 
for practitioners and researchers working in the field." 

David Naccache, Ecole normale superieure 

"Understanding the fundamentals of an area, whether it is golf or fault tolerance, 
is a prerequisite to developing expertise in the area. Krishna and Koren's book can 
provide a reader with this underlying foundation for fault tolerance. This book 
is particularly timely because the design of fault-tolerant computing components, 
such as processors and disks, is becoming increasingly important to the main¬ 
stream computing industry." 

Shubu Mukherjee, Director, FACT-AMI Group, Intel Corporation 

"Professors Koren and Krishna, have written a modern, dual purpose text that 
first presents the basics fault tolerance tools describing various redundancy types 
both at the hardware and software levels followed by current research topics. It 
reviews fundamental reliability modeling approaches, combinatorial blocks and 
Markov chain techniques. Notably, there is a complete chapter on statistical sim¬ 
ulation methods that offers guidance to practical evaluations as well as one on 
fault-tolerant networks. All chapters, which are clearly written including illumi¬ 
nating examples, have extensive reference lists whereby students can delve deeper 
into almost any topic. Several practical and commercial computing systems that 
incorporate fault tolerance are detailed. Furthermore, there are two chapters in¬ 
troducing current fault tolerance research challenges, cryptographic systems and 
defects in VLSI designs." 

Robert Redinbo, UC Davis 

"The field of Fault-Tolerant Computing has advanced considerably in the past ten 
years and yet no effort has been made to put together these advances in the form 
of a book or a comprehensive paper for the students starting in this area. This is 
the first book I know of in the past 10 years that deals with hardware and software 
aspects of fault tolerant computing, is very comprehensive, and is written as a text 
for the course." 

Kewal Saluja, University of Wisconsin, Madison 
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Foreword 


Systems used in critical applications such as health, commerce, transportation, 
utilities, and national security must be highly reliable. Ubiquitous use of com¬ 
puting systems and other electronic systems in these critical areas requires that 
computing systems have high reliability High reliability is achieved by designing 
the systems to be fault-tolerant. Even though the high reliability requirements of 
computing systems gave the original impetus to the study of the design of fault- 
tolerant systems, trends in manufacturing of VLSI circuits and systems are also 
requiring the use of fault-tolerant design methods to achieve high yields from 
manufacturing plants. This is due to the fact that with reduced feature sizes of 
VLSI circuit designs and shortcomings of lithographic techniques used in fabrica¬ 
tion the characteristics of the manufactured devices are becoming unpredictable. 
Additionally small sizes of devices make them susceptible to radiation induced 
failures causing run time errors. Thus it may be necessary to use fault tolerance 
techniques even in systems that are used in non-critical applications such as con¬ 
sumer electronics. 

This book covers comprehensively the design of fault-tolerant hardware and 
software, use of fault-tolerance techniques to improve manufacturing yields and 
design and analysis of networks. Additionally it includes material on methods to 
protect against threats to encryption subsystems used for security purposes. The 
material in the book will help immensely students and practitioners in electrical 
and computer engineering and computer science in learning how to design reli¬ 
able computing systems and how to analyze fault-tolerant computing systems. 

SudhakarM. Reddy 

Distinguished Professor of Electrical and Computer Engineering 

University oflozva Foundation 
Iowa City, Iozva 




Preface 


The purpose of this book is to provide a solid introduction to the rich field of fault- 
tolerant computing. Its intended use is as a text for senior-level undergraduate and 
first-year graduate students, as well as a reference for practicing engineers in the 
industry. Since it would be impossible to cover in one book all the fault-tolerance 
techniques and practices that have been developed or are currently in use, we 
have focused on providing the basics of the field and enough background to allow 
the reader to access more easily the rapidly expanding fault-tolerance literature. 
Readers who are interested in further details should consult the list of references 
at the end of each chapter. To understand this book well, the reader should have 
a basic knowledge of hardware design and organization, principles of software 
development, and probability theory. 

The book has 10 chapters; each chapter has a list of relevant references and a 
set of exercises. Solutions to the exercises are available on-line and access to them 
is provided by the publisher upon request to instructors who adopt this book as a 
textbook for their class. Powerpoint slides for instructors are also available. 

The book starts with an outline of preliminaries, in which we provide introduc¬ 
tory information. This is followed by a set of six chapters that form the core of 
what we believe should be covered in any introduction to fault-tolerant systems. 

Chapter 2 deals with hardware fault-tolerance; this is the discipline with the 
longest history (indeed, the idea of using hardware redundancy for fault-tolerance 
goes back to the very pioneers of computing, most notably von Neumann). We also 
include in this chapter an introduction to some of the probabilistic tools used in 
analyzing reliability measures. 

Chapter 3 deals with information redundancy with the main focus on error 
detecting and correcting codes. Such codes, like hardware fault-tolerance, go back 
a very long way, and were motivated in large measure by the need to counter 
errors in information transmission. The same, or similar, techniques are being used 
today in other applications as well, principally in contemporary memory circuits. 
We have sought to provide a survey of only the more important coding techniques. 
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and it was not intended to be comprehensive: indeed, a comprehensive survey 
of coding would require multiple volumes. Following this, we turn to the issue 
of managing information redundancy in storage, and end with algorithm-based 
fault-tolerance. 

Chapter 4 covers fault-tolerant networks. With processors becoming cheaper, 
distributed systems are becoming more commonplace; we look at some key net¬ 
work topologies and consider how to quantify and enhance their fault-tolerance. 

Chapter 5 describes techniques for software fault-tolerance. It is widely be¬ 
lieved that software accounts for a majority of the failures seen in today's com¬ 
puter systems. As a field, software fault-tolerance is less mature than fault- 
tolerance using either hardware or information redundancy. It is also a much 
harder nut to crack. Software is probably the most complex artificial construct 
that people have created, and rendering it fault-tolerant is an arduous task. We 
cover such techniques as recovery blocks and N-version programming, together 
with a discussion of acceptance tests and ways to model software failure processes 
analytically. 

In Chapter 6, we cover the use of time-redundancy through checkpointing. The 
majority of hardware failures are transient in nature; in other words, they are fail¬ 
ures which simply go away after some time. An obvious response to such failures 
is to roll back the execution and re-execute the program. Checkpointing is a tech¬ 
nique which allows us to limit the extent of such re-executions. 

Chapter 7, which contains several case studies, rounds off the core of the book. 
There, we describe several real-life examples of fault-tolerant systems illustrating 
the usage of the various techniques presented in the previous chapters. 

The remaining three chapters of the book deal with more specialized topics. In 
Chapter 8, we cover defect-tolerance in VLSI. As die sizes increase and feature 
sizes drop, it is becoming increasingly important to be able to tolerate manufac¬ 
turing defects in a VLSI chip without affecting its functionality. We discuss the key 
approaches being used, as well as the underlying mathematical models. 

In Chapter 9, we focus on cryptographic devices. The increasing use of com¬ 
puters in commerce, including smart cards and Internet shopping, has motivated 
the use of encryption in everyday applications. It turns out that injecting faults 
into cryptographic devices and observing the outputs is an effective way to attack 
secure systems and obtain their secret key. We present in this chapter the use of 
fault-detection to counter these types of security attacks. 

Chapter 10, which ends the book, deals with simulation and experimental tech¬ 
niques. Simulating a fault-tolerant system to measure its reliability is often com¬ 
putationally very demanding. We provide in this chapter an outline of basic sim¬ 
ulation techniques, as well as ways in which simulation can be accelerated. Also 
provided are basic statistical tools by which simulation output can be analyzed 
and an outline of experimental fault-injection techniques. 

A companion web site (www.ecs.umass.edu/ece/koren/FaultTolerantSyst- 
ems/) includes additional resources for the book such as lecture slides, the in¬ 
evitable list of errors, and, more importantly, a link to an extensive collection of 
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xv 


educational tools and simulators that can be of great assistance to the readers of 
the book. Elsevier also maintains an instructor web site that will house the solu¬ 
tions for those who adopt this book as a textbook for their class. The website can 
be found at http://textbooks.elsevier.com. 
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Preliminaries 


The past 50 years have seen computers move from being expensive computa¬ 
tional engines used by government and big corporations to becoming an every¬ 
day commodity, deeply embedded in practically every aspect of our lives. Not 
only are computers visible everywhere, in desktops, laptops, and PDAs, it is also 
a commonplace that they are invisible everywhere, as vital components of cars, 
home appliances, medical equipment, aircraft, industrial plants, and power gen¬ 
eration and distribution systems. Computer systems underpin most of the world's 
financial systems: given current transaction volumes, trading in the stock, bond, 
and currency markets would be unthinkable without them. Our increasing will¬ 
ingness, as a society, to place computers in life-critical and wealth-critical applica¬ 
tions is largely driven by the increasing possibilities that computers offer. And yet, 
as we depend more and more on computers to carry out all of these vital actions, 
we are—implicitly or explicitly—gambling our lives and property on computers 
doing their jobs properly. 

Computers (hardware plus software) are quite likely the most complex systems 
ever created by human beings. The complexity of computer hardware is still in¬ 
creasing as designers attempt to exploit the higher transistor density that new 
generations of technology make available to them. Computer software is far more 
complex still, and with that complexity comes an increased propensity to failure. It 
is probably fair to say that there is not a single large piece of software or hardware 
today that is free of bugs. Even the space shuttle, with software that was devel¬ 
oped and tested using some of the best and most advanced techniques known to 
engineering, is now known to have flown with bugs that had the potential to cause 
catastrophe. 

Computer scientists and engineers have responded to the challenge of design¬ 
ing complex systems with a variety of tools and techniques to reduce the number 
of faults in the systems they build. However, that is not enough: we need to build 
systems that will acknowledge the existence of faults as a fact of life, and incorpo- 
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CHAPTER 1 Preliminaries 


rate techniques to tolerate these faults while still delivering an acceptable level of 
service. The resulting field of fault tolerance is the subject of this book. 


1.1 Fault Classification 

In everyday language, the terms fault, failure, and error are used interchangeably. 
In fault-tolerant computing parlance, however, they have distinctive meanings. 
A fault (or failure) can be either a hardware defect or a software/programming 
mistake (bug). In contrast, an error is a manifestation of the fault/failure/bug. 

As an example, consider an adder circuit, with an output line stuck at 1; it al¬ 
ways carries the value 1 independently of the values of the input operands. This 
is a fault, but not (yet) an error. This fault causes an error when the adder is used 
and the result on that line is supposed to have been a 0, rather than a 1. A similar 
distinction exists between programming mistakes and execution errors. Consider, 
for example, a subroutine that is supposed to compute sin(x) but owing to a pro¬ 
gramming mistake calculates the absolute value of sin(x) instead. This mistake 
will result in an execution error only if that particular subroutine is used and the 
correct result is negative. 

Both faults and errors can spread through the system. For example, if a chip 
shorts out power to ground, it may cause nearby chips to fail as well. Errors can 
spread because the output of one unit is used as input by other units. To return 
to our previous examples, the erroneous results of either the faulty adder or the 
sin(x) subroutine can be fed into further calculations, thus propagating the error. 

To limit such contagion, designers incorporate containment zones into systems. 
These are barriers that reduce the chance that a fault or error in one zone will 
propagate to another. For example, a fault-containment zone can be created by en¬ 
suring that the maximum possible voltage swings in one zone are insulated from 
the other zones, and by providing an independent power supply to each zone. In 
other words, the designer tries to electrically isolate one zone from another. An 
error-containment zone can be created, as we will see in some detail later on, by 
using redundant units/programs and voting on their output. 

Flardware faults can be classified according to several aspects. Regarding their 
duration, hardware faults can be classified into permanent, transient, or intermittent. 
A permanent fault is just that: it reflects the permanent going out of commission of 
a component. As an example of a permanent fault think of a burned-out lightbulb. 
A transient fault is one that causes a component to malfunction for some time; it 
goes away after that time and the functionality of the component is fully restored. 
As an example, think of a random noise interference during a telephone conversa¬ 
tion. Another example is a memory cell with contents that are changed spuriously 
due to some electromagnetic interference. The cell itself is undamaged: it is just 
that its contents are wrong for the time being, and overwriting the memory cell 
will make the fault go away. An intermittent fault never quite goes away entirely; 
it oscillates between being quiescent and active. When the fault is quiescent, the 
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component functions normally; when the fault is active, the component malfunc¬ 
tions. An example for an intermittent fault is a loose electrical connection. 

Another classification of hardware faults is into benign and malicious faults. 
A fault that just causes a unit to go dead is called benign. Such faults are the eas¬ 
iest to deal with. Far more insidious are the faults that cause a unit to produce 
reasonable-looking, but incorrect, output, or that make a component "act mali¬ 
ciously" and send differently valued outputs to different receivers. Think of an 
altitude sensor in an airplane that reports a 1000-foot altitude to one unit and a 
8000-foot altitude to another unit. These are called malicious (or Byzantine) faults. 


Types of Redundancy 

All of fault tolerance is an exercise in exploiting and managing redundancy. Redun¬ 
dancy is the property of having more of a resource than is minimally necessary to 
do the job at hand. As failures happen, redundancy is exploited to mask or other¬ 
wise work around these failures, thus maintaining the desired level of functional- 

ity- 

There are four forms of redundancy that we will study: hardware, software, 
information, and time. Hardware faults are usually dealt with by using hardware, 
information, or time redundancy, whereas software faults are protected against by 
software redundancy. 

Hardware redundancy is provided by incorporating extra hardware into the 
design to either detect or override the effects of a failed component. For example, 
instead of having a single processor, we can use two or three processors, each per¬ 
forming the same function. By having two processors, we can detect the failure 
of a single processor; by having three, we can use the majority output to override 
the wrong output of a single faulty processor. This is an example of static hardware 
redundancy, the main objective of which is the immediate masking of a failure. A 
different form of hardware redundancy is dynamic redundancy, where spare com¬ 
ponents are activated upon the failure of a currently active component. A combi¬ 
nation of static and dynamic redundancy techniques is also possible, leading to 
hybrid hardware redundancy. 

Hardware redundancy can thus range from a simple duplication to complicated 
structures that switch in spare units when active ones become faulty. These forms 
of hardware redundancy incur high overheads, and their use is therefore normally 
reserved for critical systems where such overheads can be justified. In particu¬ 
lar, substantial amounts of redundancy are required to protect against malicious 
faults. 

The best-known form of information redundancy is error detection and correc¬ 
tion coding. Here, extra bits (called check bits) are added to the original data bits 
so that an error in the data bits can be detected or even corrected. The resulting 
error-detecting and error-correcting codes are widely used today in memory units 
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and various storage devices to protect against benign failures. Note that these er¬ 
ror codes (like any other form of information redundancy) require extra hardware 
to process the redundant data (the check bits). 

Error-detecting and error-correcting codes are also used to protect data commu¬ 
nicated over noisy channels, which are channels that are subject to many transient 
failures. These channels can be either the communication links among widely sep¬ 
arated processors (e.g., the Internet) or among locally connected processors that 
form a local network. If the code used for data communication is capable of only 
detecting the faults that have occurred (but not correcting them), we can retrans¬ 
mit as necessary, thus employing time redundancy. 

In addition to transient data communication failures due to noise, local and 
wide-area networks may experience permanent link failures. These failures may 
disconnect one or more existing communication paths, resulting in a longer com¬ 
munication delay between certain nodes in the network, a lower data bandwidth 
between certain node pairs, or even a complete disconnection of certain nodes 
from the rest of the network. Redundant communication links (i.e., hardware re¬ 
dundancy) can alleviate most of these problems. 

Computing nodes can also exploit time redundancy through re-execution of 
the same program on the same hardware. As before, time redundancy is effec¬ 
tive mainly against transient faults. Because the majority of hardware faults are 
transient, it is unlikely that the separate executions will experience the same fault. 
Time redundancy can thus be used to detect transient faults in situations in which 
such faults may otherwise go undetected. Time redundancy can also be used when 
other means for detecting errors are in place and the system is capable of recov¬ 
ering from the effects of the fault and repeating the computation. Compared with 
the other forms of redundancy, time redundancy has much lower hardware and 
software overhead but incurs a high performance penalty. 

Software redundancy is used mainly against software failures. It is a reasonable 
guess that every large piece of software that has ever been produced has contained 
faults (bugs). Dealing with such faults can be expensive: one way is to indepen¬ 
dently produce two or more versions of that software (preferably by disjoint teams 
of programmers) in the hope that the different versions will not fail on the same 
input. The secondary version(s) can be based on simpler and less accurate algo¬ 
rithms (and, consequently, less likely to have faults) to be used only upon the 
failure of the primary software to produce acceptable results. Just as for hard¬ 
ware redundancy, the multiple versions of the program can be executed either 
concurrently (requiring redundant hardware as well) or sequentially (requiring 
extra time, i.e., time redundancy) upon a failure detection. 


1.3 Basic Measures of Fault Tolerance 

Because fault tolerance is about making machines more dependable, it is impor¬ 
tant to have proper measures (yardsticks) by which to gauge such dependability. 
In this section, we will examine some of these yardsticks and their application. 
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A measure is a mathematical abstraction that expresses some relevant facet of 
the performance of its object. By its very nature, a measure only captures some 
subset of the properties of an object. The trick in defining a suitable measure is to 
keep this subset large enough so that behaviors of interest to the user are captured, 
and yet not so large that the measure loses focus. 

1.3.1 Traditional Measures 

We first describe the traditional measures of dependability of a single computer. 
These metrics have been around for a long time and measure very basic attributes 
of the system. Two of these measures are reliability and availability. 

The conventional definition of reliability, denoted by R(f), is the probability (as 
a function of the time t) that the system has been up continuously in the time 
interval [0, t]. This measure is suitable for applications in which even a momen¬ 
tary disruption can prove costly. One example is computers that control physical 
processes such as an aircraft, for which failure would result in catastrophe. 

Closely related to reliability are the Mean Time to Failure, denoted by MTTF, 
and the Mean Time Betzveen Faihires, MTBF. The first is the average time the system 
operates until a failure occurs, whereas the second is the average time between two 
consecutive failures. The difference between the two is due to the time needed to 
repair the system following the first failure. Denoting the Mean Time to Repair by 
MTTR, we obtain 

MTBF = MTTF + MTTR 

Availability, denoted by A(t), is the average fraction of time over the interval 
[0, t] that the system is up. This measure is appropriate for applications in which 
continuous performance is not vital but where it would be expensive to have the 
system down for a significant amount of time. An airline reservation system needs 
to be highly available, because downtime can put off customers and lose sales; 
however, an occasional (very) short-duration failure can be well tolerated. 

The long-term availability, denoted by A, is defined as 

A = lim A(t) 

£->oo 

A can be interpreted as the probability that the system will be up at some ran¬ 
dom point in time, and is meaningful only in systems that include repair of faulty 
components. The long-term availability can be calculated from MTTF, MTBF, and 
MTTR as follows: 

MTTF MTTF 

~~ MTBF “ MTTF + MTTR 

A related measure. Point Availability, denoted by A p (t), is the probability that 
the system is up at the particular time instant t. 

It is possible for a low-reliability system to have high availability: consider a 
system that fails every hour on the average but comes back up after only a second. 
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Such a system has an MTBF of just 1 hour and, consequently, a low reliability; 
however, its availability is high: A — 3599/3600 = 0.99972. 

These definitions assume, of course, that we have a state in which the system 
can be said to be “up" and another in which it is not. For simple components, this is 
a good assumption. For example, a lightbulb is either good or burned out. A wire 
is either connected or has a break in it. Flowever, for even simple systems, such an 
assumption can be very limiting. For example, consider a processor that has one 
of its several hundreds of millions of gates stuck at logic value 0. In other words, 
the output of this logic gate is always 0, regardless of the input. Suppose the rest 
of the processor is functional, and that this failed logic gate only affects the output 
of the processor about once in every 25,000 hours of use. For example, a particular 
gate in the divide unit when being faulty may result in a wrong quotient if the 
divisor is within a certain subset of values. Clearly, the processor is not fault-free, 
but would one define it as "down"? 

The same remarks apply with even greater force to systems that degrade grace¬ 
fully. By this, we mean systems with various levels of functionality. Initially, with 
all of its components operational, the system is at its highest level of functionality. 
As these components fail, the system degrades from one level of functionality to 
the next. Beyond a certain point, the system is unable to produce anything of use 
and fails completely. As with the previous example, the system has multiple "up" 
states. Is it said to fail when it degrades from full to partial functionality? Or when 
it fails to produce any useful output at all? Or when its functionality falls below a 
certain threshold? If the last, what is this threshold, and how is it determined? 

We can therefore see that traditional reliability and availability are very limited 
in what they can express. There are obvious extensions to these measures. For 
example, we may consider the average computational capacity of a system with n 
processors. Let c; denote the computational capacity of a system with i operational 
processors. This can be a simple linear function of the number of processors, C; = 
ic \, or a more complex function of i, depending on the ability of the application to 
utilize i processors. The Average Computational Capacity of the system can then be 
defined as Y^i=i where P;(f) is the probability that exactly i processors are 

operational at time t. In contrast, the reliability of the system at time t will be 

n 

R(t)=j2m 

i=m 

where m is the minimum number of processors necessary for proper operation of 
the system. 

1.3.2 Network Measures 

In addition to the general system measures previously discussed, there are also 
more specialized measures, focusing on the network that connects the processors 
together. The simplest of these are classical node and line connectivity, which are 
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Network N1 Network N2 

FIGURE 1.1 Inadequacy of classical connectivity. 

defined as the minimum number of nodes and lines, respectively, that have to fail 
before the network becomes disconnected. This gives a rough indication of how 
vulnerable a network is to disconnection: for example, a network that can be dis¬ 
connected by the failure of just one (critically positioned) node is potentially more 
vulnerable than another that requires at least four nodes to fail before it becomes 
disconnected. 

Classical connectivity is a very basic measure of network reliability. Like reli¬ 
ability, it distinguishes between only two network states: connected and discon¬ 
nected. It says nothing about how the network degrades as nodes fail before, or 
after, becoming disconnected. Consider the two networks shown in Figure 1.1. 
Both networks have the same classical node connectivity of 1. However, in a real 
sense, network N 1 is much more "connected" than N2. The probability that N2 
splinters into small pieces is greater than that for N 1. 

To express this type of "connectivity robustness," we can use additional mea¬ 
sures. Two such measures are the average node-pair distance, and the network di¬ 
ameter (the maximum node-pair distance), both calculated given the probability 
of node and /or link failure. Such network measures, together with the traditional 
measures listed above, allow us to gauge the dependability of various networked 
systems that consist of computing nodes connected through a network of commu¬ 
nication links. 

1.4 Outline of This Book 

The next chapter is devoted to hardware fault tolerance. This is the most estab¬ 
lished topic within fault-tolerant computing, and many of the basic principles and 
techniques that have been developed for it have been extended to other forms 
of fault tolerance. Techniques to evaluate the reliability and availability of fault- 
tolerant systems are introduced here, including the use of Markov models. 

Next, several variations of information redundancy are covered, starting with 
the most widely used error detecting and correcting codes. Then, other forms of 
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information redundancy are discussed, including storage redundancy (RAID sys¬ 
tems), data replication in distributed systems, and, finally, the algorithm-based 
fault-tolerance technique that tolerates data errors in array computations using 
some error-detecting and error-correcting codes. 

Many computing systems nowadays consist of multiple networked proces¬ 
sors that are subject to interconnection link failures, in addition to the already- 
discussed single node/processor failures. We, therefore, present in this book suit¬ 
able fault tolerance techniques for these networks and analysis methods to deter¬ 
mine which network topologies are more robust. 

Software mistakes/bugs are, in practice, unavoidable, and consequently, some 
level of software fault tolerance is a must. This can be as simple as acceptance tests 
to check the reasonableness of the results before using them, or as complex as run¬ 
ning two or more versions of the software (sequentially or in parallel). Programs 
also tend to have their state deteriorate after running for long periods of time and 
eventually crash. This situation can be avoided by periodically restarting the pro¬ 
gram, a process called rejuvenation. Unlike hardware faults, software faults are 
very hard to model. Still, a few such models have been developed and several of 
them are described. 

Hardware fault-tolerance techniques can be quite costly to implement. In ap¬ 
plications in which a complete and immediate masking of the effect of hardware 
faults (especially, of transient nature) is not necessary, checkpointing is an inexpen¬ 
sive alternative. For programs that run for a long time and for which re-execution 
upon a failure might be too costly, the program state can be saved (once or periodi¬ 
cally) during the execution. Upon a failure occurrence, the system can roll back the 
program to the most recent checkpoint and resume its execution from that point. 
Various checkpointing techniques are presented and analyzed in the book. 

Case studies illustrating the use of many of the fault-tolerance techniques de¬ 
scribed previously are presented, including Tandem, Stratus, Cassini, and micro¬ 
processors from IBM and Intel. 

Two fault-tolerance topics that are rapidly increasing in practical importance, 
namely, defect tolerant VLSI design and fault tolerance in cryptographic devices 
are discussed. The increasing complexity of VLSI chip design has resulted in a 
situation in which manufacturing defects are unavoidable. If nothing is done to 
remedy this situation, the expected yield (the fraction of manufactured chips which 
are operational) will be very low. Thus, techniques to reduce the sensitivity of 
VLSI chips to defects have been developed, some of which are very similar to the 
hardware redundancy schemes. 

For cryptographic devices, the need for fault tolerance is two-fold. Not only 
is it crucial that such devices (e.g., smart cards) operate in a fault-free manner in 
whatever environment they are used, but more importantly, they must stay secure. 
Fault-injection-based attacks on cryptographic devices have become the simplest 
and fastest way to extract the secret key from the device. Thus, the incorporation 
of fault tolerance is a must in order to keep cryptographic devices secure. 
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An important part of the design and evaluation process of a fault-tolerant sys¬ 
tem is to demonstrate that the system does indeed function at the advertised 
level of reliability Often the designed system is too complex to develop analyt¬ 
ical expressions of its reliability If a prototype of the system has already been 
constructed, then fault-injection experiments can be performed and certain de¬ 
pendability attributes measured. If, however, as is very common, a prototype does 
not yet exist, statistical simulation must be used. Simulation programs for com¬ 
plex systems must be carefully designed to produce accurate results. We discuss 
the principles that should be followed when preparing a simulation program, and 
show how simulation results can be analyzed to infer system reliability. 


1.5 Further Reading 

Several textbooks and reference books on the topic of fault tolerance have been 
published in the past. See, for example, [2,4,5,9,10,13-16]. Journals have published 
several special issues on fault-tolerant computing, e.g., [7,8]. The major conference 
in the field is the Conference on Dependable Systems and Netzvorks (DSN) [3]; this is a 
successor to the Faidt-Tolerant Computing Symposium (FTCS). 

The concept of computing being invisible everywhere appeared in [19], in the 
context of pervasive computing, that is, computing that pervades everyday living, 
without being obtrusive. 

The definitions of the basic terms and measures appear in most of the text¬ 
books mentioned above and in several probability and statistics books. For exam¬ 
ple, see [18]. Our definitions of fault and error are slightly different from those 
used in some of the references. A generally used definition of an error is that it is 
that part of the system state that leads to system failure. Strictly interpreted, this 
only applies to a system with state, i.e., with memory. We use the more encompass¬ 
ing definition of anything that can be construed as a manifestation of a fault. This 
wider interpretation allows purely combinational circuits, which are stateless, to 
generate errors. 

One measure of dependability that we did not describe in the text is to con¬ 
sider everything from the perspective of the application. This approach was taken 
to define the measure known as performability. The application is used to define 
"accomplishment levels" L \ , Li, ... ,L„. Each of these represents a level of quality 
of service delivered by the application. For example, L, may be defined as follows: 
"There are i system crashes during the time period [0, T]." Now, the performance 
of the computer affects this quality (if it did not, by definition, it would have noth¬ 
ing to do with the application!). The approach taken by performability is to link 
the performance of the computer to the accomplishment level that this enables. 
Performability is then a vector, ( P(Li),P(Li ),... ,P(L„)), where P(L,) is the probabil¬ 
ity that the computer functions well enough to permit the application to reach up 
to accomplishment level L,. For more on performability, see [6,11,12]. 
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Hardware Fault 
Tolerance 


Hardware fault tolerance is the most mature area in the general field of fault- 
tolerant computing. Many hardware fault-tolerance techniques have been devel¬ 
oped and used in practice in critical applications ranging from telephone ex¬ 
changes to space missions. In the past, the main obstacle to a wide use of hardware 
fault tolerance has been the cost of the extra hardware required. With the contin¬ 
ued reduction in the cost of hardware, this is no longer a significant drawback, and 
the use of hardware fault-tolerance techniques is expected to increase. However, 
other constraints, notably on power consumption, may continue to restrict the use 
of massive redundancy in many applications. 

This chapter first discusses the rate at which hardware failures occur, as well 
as its effect on the reliability of a single component. We then extend the discus¬ 
sion to more complex systems consisting of multiple components, describe vari¬ 
ous resilient structures which have been proposed and implemented, and evalu¬ 
ate their reliability and / or availability. Next, we describe hardware fault-tolerance 
techniques that have been developed specifically for general-purpose processors. 
Finally, we discuss malicious faults and investigate the amount of redundancy 
needed for tolerating them. 


2.1 The Rate of Hardware Failures 

The single most important parameter used in the reliability analysis of hardware 
systems is the component failure rate, which is the rate at which an individual 
component suffers faults. This is the expected number of failures per unit time 
that a currently good component will suffer in a given future time interval. The 
failure rate depends on the current age of the component, any voltage or physical 
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FIGURE 2.1 Bathtub curve. 


shocks that it suffers, the ambient temperature, and the technology. The depen¬ 
dence on age is usually captured by what is known as the bathtub curve (see Fig¬ 
ure 2.1). When components are very young, their failure rate is quite high. This 
is due to the chance that some components with manufacturing defects slipped 
through manufacturing quality control and were released. As time goes on, these 
components are weeded out, and the component spends the bulk of its life show¬ 
ing a fairly constant failure rate. As it becomes very old, aging effects start to take 
over, and the failure rate rises again. 

The impact of the other factors can be expressed through the following empiri¬ 
cal failure rate formula: 


X — TtlTXQ{Cl7TTTXV + C27Te) (2.1) 

where the notations are as follows: 

A, Failure rate of component. 

jtl Learning factor, associated with how mature the technology is. 

jtn Quality factor, representing manufacturing process quality control (rang¬ 
ing from 0.25 to 20.00). 

ttj Temperature factor, with values ranging from 0.1 to 1000. It is propor¬ 
tional to e -Ea ^ T , where £ a is the activation energy in electron-volts 
associated with the technology, k is the Boltzmann constant (0.8625 x 
10 -4 eV/K), and T is the temperature in Kelvin. 

icy Voltage stress factor for CMOS devices; can range from 1 to 10, depend¬ 
ing on the supply voltage and the temperature; does not apply to other 
technologies (where it is set to 1). 

tte Environment shock factor; ranges from very low (about 0.4), when the 
component is in an air-conditioned office environment, to very high (13.0) 
when it is in a harsh environment. 



2.2 Failure Rate, Reliability, andMean Time to Failure 


13 


Ci, C 2 Complexity factors; functions of the number of gates on the chip and the 
number of pins in the package. 

Further details can be found in MIL-HDBK-217E, which is a handbook pro¬ 
duced by the U.S. Department of Defense. 

Devices operating in space, which is replete with charged particles and can sub¬ 
ject devices to severe temperature swings, can thus be expected to fail much more 
often than their counterparts in air-conditioned offices, so too can computers in 
automobiles (which suffer high temperatures and vibration) and industrial appli¬ 
cations. 


2.2 Failure Rate, Reliability, and 
Mean Time to Failure 


In this section, we consider a single component of a more complex system, and 
show how reliability and Mean Time to Failure (MTTF) can be derived from the 
basic notion of failure rate. Consider a component that is operational at time f = 0 
and remains operational until it is hit by a failure. Suppose now that all failures 
are permanent and irreparable. Let T denote the lifetime of the component (the 
time until it fails), and let/(f) and F(f) denote the probability density function of 
T and the cumulative distribution function of T, respectively. These functions are 
defined for f ^ 0 only (because the lifetime cannot be negative) and are related 
through 

/(0 = d ci?' F{t) = fo f{r) dT (2 - 2) 

/(f) represents (but is not equal to) the momentary probability of failure at time 
f. To be exact, for a very small Af,/(f)Af Prob{f C T C f + Af}. Being a density 
function,/(f) must satisfy 


/(f) ^ 0 for f > 0 and 



= 1 


F(f) is the probability that the component will fail at or before time f. 


F(f) = ProbfT < f} 


R(f), the reliability of a component (the probability that it will survive at least until 
time f), is given by 

R(t) = ProbfT > f} = 1 - F(f) (2.3) 

/(f) represents the probability that a new component will fail at time f in the future. 
A more meaningful quantity is the probability that a good component of current 
age f will fail in the next instant of length df. This is a conditional probability, since 
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we know that the component survived at least until time f. This conditional prob¬ 
ability is represented by th e failure rate (also called the hazard rate) of a component 
at time t, denoted by X(t), which can be calculated as follows: 


m= 


m 

i -m 


Since = —/(f), we obtain 


m=- 


1 d R(t) 
Rdt) d t 


(2.4) 


(2.5) 


Certain types of components suffer no aging and have a failure rate that is constant 
over time, X(t) = X. In this case. 


d R(t) 
d t 


= —XR(t) 


and the solution of this differential equation (with R( 0) = 1) is 

R(t) = e _Af 


( 2 . 6 ) 


Therefore, a constant failure rate implies that the lifetime T of the component has 
an exponential distribution, with a parameter that is equal to the constant failure 
rate X 

/(f) = Xe~ xt F(t) = 1 - e“ Af R(t) = e“ At for f ^ 0 

For an irreparable component, the MTTF is equal to its expected lifetime, £[T] 
(where E[ ] denotes the expectation or mean of a random variable) 


MTTF — E[T] — 



(2.7) 


Substituting = —/(f) yields 

r°° d R(t) r°° 

MTTF — — f-^df=-fR(f)| R(f)df = j R(t)dt (2.8) 

(the term —tR(t) is equal to zero when f = 0 and when f = oo, since R(oc) = 0). 

For the case of a constant failure rate for which R(t) = e~ xt , 

roc i 

MTTF = / e“ Af df=- 

J o * 

Although a constant failure rate is used in most calculations of reliability (mainly 
owing to the simplified derivations), there are cases for which this simplifying 
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assumption is inappropriate, especially during the "infant mortality" and "wear- 
out” phases of a component's life (Figure 2.1). In such cases, the Weibull distrib¬ 
ution is often used. This distribution has two parameters, X and /, and has the 
following density function of the lifetime T of a component: 

/(f) = xpt p ~ l e~ xtfi (2.9) 

The corresponding failure rate is 


X(t) = A/f^ -1 (2.10) 

This failure rate is an increasing function of time for / > 1, is constant for / = 1, 
and is a decreasing function of time for / < 1. This makes it very flexible, and es¬ 
pecially appropriate for the wear-out and infant mortality phases. The component 
reliability for a Weibull distribution is 

R(t) = e~ xtl! (2.11) 

and the MTTF of the component is 

MTTF = } (2.12) 

pxP~ x 

where F(x) = / 0 °° y x ~ 1 e~' 1 dy is the Gamma function. The Gamma function is a gen¬ 
eralization of the factorial function to real numbers, and satisfies 

■ T(x) = (x — l)T(x — 1) for x > 1; 

■ T(0) = T(l) = 1; 

■ r(n) = (n-l)! for an integer n, n = 1,2,_ 

Note that the Weibull distribution includes as a special case (/ = 1) the exponential 
distribution with a constant failure rate X. 

With these preliminaries, we now turn to structures that consist of more than 
one component. 


2.3 Canonical and Resilient Structures 

In this section, we consider some canonical structures, out of which more complex 
structures can be constructed. We start with the basic series and parallel struc¬ 
tures, continue with non-series /parallel ones, and then describe some of the many 
resilient structures that incorporate redundant components (next referred to as 
modules). 
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(a) Series system (b) Parallel system 


FIGURE 2.2 Series and parallel systems. 


2.3.1 Series and Parallel Systems 

The most basic structures are the series and parallel systems depicted in Figure 2.2. 
A series system is defined as a set of N modules connected together so that the fail¬ 
ure of any one module causes the entire system to fail. Note that the diagram in 
Figure 2.2a is a reliability diagram and not always an electrical one; the output of 
the first module is not necessarily connected to the input of the second module. 
The four modules in this diagram can, for example, represent the instruction de¬ 
code unit, execution unit, data cache, and instruction cache in a microprocessor. 
All four units must be fault-free for the microprocessor to function, although the 
way they are connected does not resemble a series system. 

Assuming that the modules in Figure 2.2a fail independently of each other, the 
reliability of the entire series system is the product of the reliabilities of its N mod¬ 
ules. Denoting by R/(f) the reliability of module i and by R s (f) the reliability of the 
whole system, 

N 

Rs(t)= n R '( f ) (2 - i3) 

i=l 

If module i has a constant failure rate, denoted by A.,, then, according to Equa¬ 
tion 2.6, Ri(t) — e _A,f , and consequently. 


Rs(t) = e~ lst (2.14) 

where X s = A,. From Equation 2.14 we see that the series system has a con¬ 

stant failure rate equal to X s (the sum of the individual failure rates), and its MTTF 
is therefore MTTF S = X. 

A parallel system is defined as a set of N modules connected together so that 
it requires the failure of all the modules for the system to fail. This leads to the 
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FIGURE 2.3 A non-series/parallel system. 

following expression for the reliability of a parallel system, denoted by Rp(t): 

N 

Rp(t) = l-l\(l-Ri(t)) (2.15) 

('=1 

If module i has a constant failure rate /,„ then 

N 

R p (t) = l-Y\(l-e~ kit ) (2.16) 

i=l 

As an example, the reliability of a parallel system consisting of two modules with 
constant failure rates '/.\ and Xi is given by 

Rp(t) = e“ Alf + e~ k2t - e~ {kl+k2)t 

Note that a parallel system does not have a constant failure rate; its failure rate 
decreases with each failure of a module. It can be shown that the MTTF of a parallel 
system with all its modules having the same failure rate X is MTTF^, = Yj?=i w- 

2.3.2 Non-Series/Parallel Systems 

Not all systems have a reliability diagram with a series/parallel structure. Fig¬ 
ure 2.3 depicts a non-series/parallel system whose reliability cannot be calculated 
using either Equation 2.13 or 2.15. Each path in Figure 2.3 represents a configu¬ 
ration that allows the system to operate successfully. For example, the path ADF 
means that the system operates successfully if all three modules A, D and F are 
fault-free. A path in such reliability diagrams is valid only if all modules and 
edges are traversed from left to right. The path BCDF in Figure 2.3 is thus in¬ 
valid. No graph transformations that may result in violations of this rule are al¬ 
lowed. 
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(a) C not working 



(b) C working 


FIGURE 2.4 Expanding the diagram in Figure 2.3 about module C. 

In the following analysis, the dependence of the reliability on the time f is omit¬ 
ted for simplicity of notation, although it is implied that all reliabilities are func¬ 
tions of t. 

We calculate the reliability of the non-series/parallel system in Figure 2.3 by ex¬ 
panding about a single module i. That is, we condition on whether or not module 
i is functional, and use the Total Probability formula. 

^system = Ri • ProbfSystem works|/ is fault-free} 

+ (1 — Ri) • Prob{System works \i is faulty} (2.17) 

where, as before, R, denotes the reliability of module i (i — A,B,C,D r E,F). We can 
now draw two new diagrams. In the first, module i will be assumed to be working, 
and in the second, module i will be faulty. Module i is selected so that the two new 
diagrams are as close as possible to simple series/parallel structures for which we 
can then use Equations 2.13 and 2.15. Selecting module C in Figure 2.3 results in 
the two diagrams in Figure 2.4. The process of expanding is then repeated until 
the resulting diagrams are of the series/parallel type. Figure 2.4a is already of 
the series/parallel type, whereas Figure 2.4b needs further expansion about E. 
Note that Figure 2.4b should not be viewed as a parallel connection of A and B, 
connected serially to D and E in parallel; such a diagram will have the path BCDF, 
which is not a valid path in Figure 2.3. Based on Figure 2.4 we can write, using 
Equation 2.17, 


^system = Rc ' ProbfSystem works|C is fault-free} 

+ (1 - Rc)Rf[ 1 - (1 - RaRd)( 1 - RbRe )] 


( 2 . 18 ) 
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Expanding the diagram in Figure 2.4b about £ yields 
ProbjSystem works|C is fault-free} 

= ReRf[ 1 - (1 - RaX 1 - Rb)] + (1 - Re)R A RdRf 

Substituting this last expression in 2.18 results in 

^system — Rc[ReRf(R a + Rb- R a Rb) + (1 - Re)R a RdRf] 
+ (1 - R c )[Rf(R A Rd + RbRe - R a RdRbRe )] 


(2.19) 


If R/\ — Rb — Rc — Rd = Re — Rf — R> then 

^system — R ^ (R? ~ 3 R 3 + R + 2) (2.20) 

If the diagram of the non-series/parallel structure is too complicated to apply the 
above procedure, upper and lower bounds on (^system can be calculated instead. 
An upper bound is given by 

^system ^ 1 — J J 0 — k path i) (2.21) 

where R pat h / is the reliability of the series connection of the modules along path i. 
The bound in Equation 2.21 assumes that all the paths are in parallel and that they 
are independent. In reality, two of these paths may have a module in common, 
and the failure of this module will result in both paths becoming faulty. That is 
why Equation 2.21 provides only an upper bound rather than an exact value. As 
an example, let us calculate the upper bound for Figure 2.3. The paths are ADF, 
BEF, and ACEF, resulting in 

^system < 1 - (1 - R A RdRf)( 1 - RbReRfX^ ~ R A RcReRf) (2.22) 

If R a = R b — R c — R d — Re — Rf — R, then R system < R 3 (R 7 - 2 £ 4 - R 3 + R + 2), 
which is less accurate than the exact calculation in Equation 2.20. 

The upper bound can be used to derive the exact reliability, by performing the 
multiplication in Equation 2.22 (or Equation 2.21 in the general case) and replacing 
every occurrence of R 7 by Rj. Since each module is used only once, its reliability 
should not be raised to any power greater than 1. The reader is invited to verify 
that applying this rule to the upper bound in Equation 2.22 yields the same exact 
reliability as in Equation 2.19. 

A lower bound can be calculated based on minimal cut sets of the system dia¬ 
gram, where a minimal cut set is a minimal list of modules such that the removal 
(due to faults) of all modules from the set will cause a working system to fail. The 
lower bound is obtained by 


R 


system 




na-Qcut,) 


(2.23) 



CHAPTER 2 Hardware Fault Tolerance 



FIGURE 2.5 Comparing the exact reliability of the non-series/parallel system in Fig¬ 
ure 2.3 to its upper and lower bounds. 


where Q cut , is the probability that minimal cut i is faulty. In Figure 2.3, the minimal 
cut sets are F, AB , AE, DE, and BCD. Consequently, 

Rsystem ^ Rf[ 1 - (1 - Ra)( 1 - Rb)] [1 - (1 - Ra)( 1 - Re)] [1 - (1 - Rd)(1 - Re)] 
x[i —(l — r b )(1-R c )( 1-R d )] (2.24) 

If Ra — Rb — Rc — Rd — Re = Rf — R, then R S ystem S? R 5 (24 — 60R + 62 R~ - 33R 3 + 
9R 4 — R 5 ). Figure 2.5 compares the upper and lower bounds to the exact system 
reliability for the case in which all six modules have the same reliability R. Note 
that in this case, for the more likely high values of R, the lower bound provides a 
very good estimate for the system reliability 

2.3.3 M-of-N Systems 

An M-of-N system is a system that consists of N modules and needs at least M of 
them for proper operation. Thus, the system fails when fewer than M modules are 
functional. The best-known example of this type of systems is the triplex, which 
consists of three identical modules whose outputs are voted on. This is a 2-of-3 
system: so long as a majority (2 or 3) of the modules produce correct results, the 
system will be functional. 

Let us now compute the reliability of an M-of-N system. We assume as before 
that the failures of the different modules are statistically independent and that 
there is no repair of failing modules. If R(f) is the reliability of an individual mod¬ 
ule (the probability that the module is still operational at time f), the reliability 
of an M-of-N system is the probability that M or more modules are functional at 
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FIGURE 2.6 A Triple Modular Redundant (TMR) structure. 

time t. The system reliability is therefore given by 

N , . 

R M _oi_N(t) = E ( 7 ) R! ’( f )[! - m] N ~ l (2.25) 

i=M ' ' 

where (7) = EEi • Th e assumption that failures are independent is key to the 
high reliability of M-of-N systems. Even a slight extent of positively correlated 
failures can greatly diminish their reliability. For example, suppose q COT is the prob¬ 
ability that the entire system suffers a common failure. The reliability of the system 
now becomes 


N , . 

R M7of_N« = (1 - <7cor) E 7 [1 - R (0f (2-26) 

i=M ' ' 

If the system is not designed carefully, the correlated failure factor can dominate 
the overall failure probability 

In practice, correlated failure rates can be extremely difficult to estimate. In 
Equation 2.26, we assumed that there was a failure mode in which the entire clus¬ 
ter of N modules suffers a common failure. However, there are other modes as 
well, in which subsets of the N modules could undergo a correlated failure. There 
being 2 N — N — 1 subsets containing two or more modules, it quickly becomes in¬ 
feasible to obtain by experiment or otherwise the correlated failure probabilities 
associated with each of the subsets, even for moderate values of N. 

Perhaps the most important M-of-N system is the triplex, or the Triple Modular 
Redundant (TMR) cluster shown in Figure 2.6. In such a system, M — 2 and N — 3, 
and a voter selects the majority output. If a single voter is used, that voter becomes 
a critical point of failure and the reliability of the cluster is 

R T Mr( 0 = Rvoter(f) E ( / ) R '( f )D “ R (0] 3 “ ! 
i =2 ^ ' 

= Rvoter(f)(3R 2 (f)[l - R(t)] +R 3 (t)) = R vote r(f)(3R 2 (f) - 2R 3 (f)) (2.27) 

where R vo ter (f) is the reliability of the voter. 
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FIGURE 2.7 Comparing NMR reliability (for N = 3 and 5) to that of a single module 
(voter failure rate is considered negligible). 


The general case of TMR is called N-modular redundancy (NMR) and is an 
M-of-N cluster with N odd and M— [N/2]. 

In Figure 2.7, we plot the reliability of a simplex (a single module), a triplex 
(TMR), and an NMR cluster with N — 5. For high values of R(f), the greater the 
redundancy, the higher the system reliability As R(f) decreases, the advantages 
of redundancy become less marked; until for R(f) < 0.5, redundancy actually be¬ 
comes a disadvantage, with the simplex being more reliable than either of the re¬ 
dundant arrangements. This is also reflected in the value of MTTFjmr, which (for 
Rvoter(0 = 1 and R(f) = e -Af ) can be calculated based on Equation 2.8 as 

roo roo c 

MTTFtmr = / (3 R 2 (t) - 2 R 3 (f)) df = / (3e~ m - 2e~ 3Xf ) d t= — 

Jo Jo 6A. 

< - = MTTF Simp i ex 

In most applications, however, R(t) 0.5 for realistic t and the system is repaired 
or replaced long before R(t) < 0.5, so a triplex arrangement does offer significant 
reliability gains. 

Equation 2.27 was derived under the conservative assumption that every fail¬ 
ure of the voter will lead to erroneous system output and that any failure of two 
modules is fatal. This is not necessarily the case. If, for example, one module has 
a permanent logical 1 on one of its outputs and a second module has a perma¬ 
nent logical 0 on its corresponding output, the TMR (or NMR) will still function 
properly. Clearly, a similar situation may arise regarding certain faults within the 
voter circuit. These are examples of compensating faults. Another case of faults 
that may be harmless are non-overlapping faults. For example, one module may 
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have a faulty adder and another module a faulty multiplier. If the adder and mul¬ 
tiplier circuits are disjoint, the two faulty modules are unlikely to generate wrong 
outputs simultaneously If all compensating and non-overlapping faults are taken 
into account, the resulting reliability will be higher than that predicted by Equa¬ 
tion 2.27. 

2.3.4 Voters 

A voter receives inputs X\,X 2 , ..., X\; from an M-of-N cluster and generates a repre¬ 
sentative output. The simplest voter is one that does a bit-by-bit comparison of the 
outputs, and checks if a majority of the N inputs are identical. If so, it outputs the 
majority. This approach only works when we can guarantee that every functional 
module will generate an output that matches the output of every other functional 
module, bit by bit. This will be the case if the modules are identical processors, use 
identical inputs and identical software, and have mutually synchronized clocks. 

If, however, the modules are different processors or are running different soft¬ 
ware for the same problem, it is possible for two correct outputs to diverge slightly, 
in the lower significant bits. Hence, we can declare two outputs x and y as practi¬ 
cally identical if \x — y\ <8 for some specified 8. (Note that "practically identical" 
is not transitive; if A is practically identical to B and B is practically identical to C, 
this does not necessarily mean that A is practically identical to C.) 

For such approximate agreement, we can do plurality voting. A k-plurality voter 
looks for a set of at least k practically identical outputs (this is a set in which each 
member is practically identical to all other members) and picks any of them (or the 
median) as the representative. For example, if we set 5 = 0.1 and the five outputs 
were 1.10,1.11,1.32,1.49,3.00, then the subset {1.10,1.11} would be selected by a 
2-plurality voter. 

In our discussion so far, we have implicitly assumed that each output has an 
equal chance of being faulty. In some cases that may not be true; the hardware (or 
software) producing one output may have a different failure probability than does 
the hardware (or software) producing another output. In this case, each output is 
assigned a weight that is related to its probability of being correct. The voter then 
does weighted voting and produces an output that is associated with over half the 
sum of all weights. 

2.3.5 Variations on JV-Modular Redundancy 

Unit-Level Modular Redundancy 

In addition to applying replication and voting at the level of the entire system, the 
same idea can be applied at the subsystem level as well. Figure 2.8 shows how 
triple-modular replication can be applied at the individual unit level for a system 
consisting of four units. In such a scheme, the voters are no longer as critical as in 
NMR. A single faulty voter will cause no more harm than a single faulty unit, and 
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FIGURE 2.9 Triplicated voters in a processor/memory TMR. 

the effect of either one will not propagate beyond the next level of units. Clearly, 
the level at which replication and voting are applied can be further lowered at the 
expense of additional voters, increasing the overall size and delay of the system. 

Of particular interest is the triplicated processor/memory system shown in 
Figure 2.9. Here, all communications (in either direction) between the tripli¬ 
cated processors and triplicated memories go through majority voting. This or¬ 
ganization is more reliable than a single majority voting of a triplicated proces¬ 
sor/memory structure. 

Dynamic Redundancy 

The above variations of NMR employ considerable amounts of hardware in order 
to instantaneously mask errors that may occur during the operation of the sys¬ 
tem. However, in many applications, temporary erroneous results are acceptable 
as long as the system is capable of detecting such errors and reconfiguring itself by 
replacing the faulty module with a fault-free spare module. An example of such 
a dynamic (or active) redundancy scheme is depicted in Figure 2.10, in which the 
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FIGURE 2.10 Dynamic redundancy. 

system consists of one active module, N spare modules, and a Fault Detection and 
Reconfiguration unit that is assumed to be capable of detecting any erroneous out¬ 
put produced by the active module, disconnecting the faulty active module, and 
connecting instead a fault-free spare (if one exists). 

Note that if all the spare modules are active (powered), we expect them to have 
the same failure rate as the single active module. This dynamic redundancy struc¬ 
ture is, therefore, similar to the basic parallel system in Figure 2.2, and its reliability 
is given by 

^dynamic (f) = R d ru(0(l ~ [l ~ *(*)] N+1 ) (2.28) 

where R(t) is the reliability of each module, and R c j ru (f) is the reliability of the 
detection and reconfiguration unit. If, however, the spare modules are not pow¬ 
ered (in order to conserve energy), they may have a negligible failure rate when 
not in operation. Denoting by c the coverage factor, defined as the probability that 
the faulty active module will be correctly diagnosed and disconnected and a good 
spare will be successfully connected, we can derive the system reliability for very 
large N by arguing as follows. 

Failures to the active module occur at rate X. The probability that a given such 
failure cannot be recovered from is 1 — c. Flence, the rate at which unrecoverable 
failures occur is (1 — c)X. The probability that no unrecoverable failure occurs to 
the active processor over a duration t is therefore given by e _ ( 1_c W; the reliability 
of the reconfiguration unit is given by Rdru(f)- We therefore have: 

^dynamic(t) = Rdru(0e _(1_C)W (2.29) 


Hybrid Redundancy 

An NMR system is capable of masking permanent and intermittent failures, but 
as we have seen, its reliability drops below that of a single module for very long 
mission times if no repair or replacement are taking place. The objective of hy¬ 
brid redundancy is to overcome this by adding spare modules that will be used 
to replace active modules once they become faulty. Figure 2.11 depicts a hybrid 
system consisting of a core of N processors constituting an NMR, and a set of K 
spares. The outputs of the active primary modules are compared (by the Compare 
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FIGURE 2.11 Hybrid redundancy. 

unit) to the output of the voter to identify a faulty primary (if any). The Compare 
unit then generates the corresponding disagreement signal, which will cause the 
Reconfiguration unit to disconnect the faulty primary and connect a spare module 
instead. 

The reliability of a hybrid system with a TMR core and K spares is 

R hybrid (0 = Rvoter(f)Rrec(f)(l - mR(t)[ 1 - R(f)f 1 - [l - R(t)f) (2.30) 

where m — K + 3 is the total number of modules, and R v o ter(0 and R rec (f) are the 
reliability of the voter and the comparison and reconfiguration circuitry, respec¬ 
tively. Equation 2.30 assumes that any fault in either the voter or the comparison 
and reconfiguration circuit will cause a system failure. In practice, not all faults 
in these circuits are fatal, and the reliability of the hybrid system will be higher 
than what is predicted by Equation 2.30. A more accurate value of R hybrid (0 can 
be obtained through a detailed analysis of the voter and the comparison and re¬ 
configuration circuits and the different ways in which they can fail. 

Sift-Out Modular Redundancy 

As in NMR, all N modules in the Sift-out Modular Redundancy scheme (see Fig¬ 
ure 2.12) are active, and the system is operational as long as there are at least two 
fault-free modules. Unlike NMR, this system uses comparator, detector, and col¬ 
lector circuits instead of a majority voter. The comparator compares the outputs 
of all pairs of modules, so that E, ; = 1 if the outputs of modules i and j do not 
match. Based on these signals, the detector determines which modules are faulty 
and generates the logical outputs R\,p 2 , ■ ■ ■, F.v, where F, = 1 if module i has been 
determined to be faulty and 0 otherwise. Finally, the collector unit produces the 
system output, which is the OR of the outputs of all fault-free modules. This way, a 
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FIGURE 2.12 Sift-out structure. 



FIGURE 2.13 Duplex system. 

module whose output disagrees with the outputs of the other modules is switched 
out and no longer contributes to the system output. The implementation of this 
scheme is simpler than that of hybrid redundancy. 

Care must be taken, however, not to be too aggressive in the purging (sifting- 
out) process. The vast majority of failures tend to be transient and disappear on 
their own after some time. It is preferable, therefore, to only purge a module if it 
produces incorrect outputs over a sustained interval of time. 

2.3.6 Duplex Systems 

A duplex system is the simplest example of module redundancy. Figure 2.13 
shows an example of a duplex system consisting of two processors and a com¬ 
parator. Both processors execute the same task, and if the comparator finds that 
their outputs are in agreement, the result is assumed to be correct. The implicit 
assumption here is that it is highly unlikely for both processors to suffer identical 
hardware failures that result in their both producing identical wrong results. If, on 
the other hand, the results are different, there is a fault, and higher-level software 
has to decide how it is to be handled. 

The fact that the two processors disagree does not, by itself, allow us to identify 
the failed processor. This can be done using one of several methods, some of which 
we will consider below. To derive the reliability of the duplex system we denote, 
as before, by c the coverage factor, which is the probability that a faulty processor 
will be correctly diagnosed, identified, and disconnected. 
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Assuming that the two processors are identical, each with a reliability R(t), the 
reliability of the duplex system is 

^duplex (f) Rcomp(f)(R 2 (f) + 2cR(f)[l - m]) (2.31) 

where R c 0 mp is the reliability of the comparator. Assuming a fixed failure rate of X 
for each processor and an ideal comparator (R c O mp(0 = 1)/ the MTTF of the duplex 
system is 

MTTF dup i ex = — + — 

The main difference between a duplex and a TMR system is that in a duplex, the 
faulty processor must be identified. We discuss next the various ways in which 
this can be done. 

Acceptance Tests 

The first method for identifying the faulty processor is to carry out a check of 
each processor's output and is known as an acceptance test. One example of an 
acceptance test is a range test, which checks if the output is in the expected range. 
This is a basic and simple test, which usually works very well. For example, if the 
output of a processor is supposed to indicate the predicted pressure in a container 
(for gases or liquids), we would know the range of pressures that the container 
can hold. Any output outside those values results in the output being flagged as 
faulty. We are therefore using semantic information of the task to predict which 
values of output indicate an error. 

The question is now how to determine the range of acceptable values. The nar¬ 
rower this range, the greater the probability that an incorrect output will be iden¬ 
tified as such but so is the probability that a correct output will be misidentified as 
erroneous. We define the sensitivity of a test as the conditional probability that the 
test detects an error given that the output is actually erroneous, and the specificity 
of a test as the conditional probability that the output is erroneous, given that the 
acceptance test declares an error. A narrow range acceptance test will have high 
sensitivity but low specificity, which means that the test is very likely not to miss an 
erroneous output but at the same time it is likely to get many false-positive results 
(correct results that the test declares faulty). 

The reverse happens when we make this range very wide: then we have low 
sensitivity but high specificity. We will consider this problem again when we dis¬ 
cuss recovery block approaches to software fault tolerance in Chapter 5. 

Range tests are the simplest, but by no means the only, acceptance test mecha¬ 
nism. Any other test that can discriminate reasonably accurately between a correct 
and an incorrect output can be used. For instance, suppose we want to check the 
correctness of a square-root operation; since {-Jx) 2 — x, we can square the output 
and check if it is the same as the input (or sufficiently close, depending on the level 
of precision used). 
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Hardware Testing 

The second method of identifying the failed processor is to subject both processors 
to some hardware/logic test routines. Such diagnostic tests are regularly used to 
verify that the processor circuitry is functioning properly, but running them can 
identify the processor that produced the erroneous output only if a permanent 
fault is present in that processor. Since most hardware faults are transient, hard¬ 
ware testing has a low probability of identifying the processor that failed to pro¬ 
duce the correct output. 

Even if the hardware fault is permanent, running hardware tests does not guar¬ 
antee that the fault will be detected. In practice, hardware tests are never perfect, 
and there is a non-zero probability that the test passes as good a processor which is 
actually faulty. The test sensitivity, or the probability of the test identifying a faulty 
processor as such, is in the case of hardware tests often called the test coverage. 

Forward Recovery 

A third method for identifying the faulty processor in a duplex is to use a third 
processor to repeat the computations carried out by the duplex. If only one of the 
three processors (the duplex plus this new processor) is faulty, then whichever 
processor the third disagrees with is the faulty one. 

It is also possible to use a combination of these methods. The acceptance test is 
the quickest to run but is often the least sensitive. The result of the acceptance test 
can be used as a provisional indication of which processor is faulty, and this can 
be confirmed by using either of the other two approaches. 

Pair-and-Spare System 

Several more complicated resilient structures have been proposed that use the du¬ 
plex as their building block. The first such system that we describe is the pair- 
and-spare system (see Figure 2.14), in which modules are grouped in pairs, and 
each pair has a comparator that checks if the two outputs are equal (or sufficiently 
close). If the outputs of the two primary modules do not match, this indicates an 
error in at least one of them but does not indicate which one is in error. Running 
diagnostic tests, as described in the previous section, will result in a disruption in 
service. To avoid such a disruption, the entire pair is disconnected and the com¬ 
putation is transferred to a spare pair. The two members of the switched-out pair 
can now be tested offline to determine whether the error was due to a transient or 
permanent fault. In the case of a transient fault, the pair can eventually be marked 
as a good spare. 

Triplex-Duplex System 

Another duplex-based structure is the triplex-duplex system. Here, processors 
are tied together to form duplexes, and then, a triplex is formed out of these du¬ 
plexes. When the processors in a duplex disagree, both of them are switched out of 
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FIGURE 2.14 A pair-and-spare structure consisting of two duplexes. 

the system. The triplex-duplex arrangement allows for the error masking of vot¬ 
ing combined with a simpler identification of faulty processors. Furthermore, the 
triplex can continue to function even if only one duplex is left functional, because 
the duplex arrangement allows the detection of faults. Deriving the reliability of a 
triplex-duplex system is reasonably simple and is left for the reader as an exercise. 


2.4 Other Reliability Evaluation Techniques 

Most of the structures that we have described so far have been simple enough to 
allow reliability derivations using straightforward, and relatively simple, combi¬ 
natorial arguments. Analysis of more complex resilient structures requires more 
advanced reliability evaluation techniques, some of which are described next. 


2.4.1 Poisson Processes 

Consider non-deterministic events of some sort, occurring over time with the fol¬ 
lowing probabilistic behavior: 

For a time interval of very short length Af, 

1. The probability of one event occurring during the interval At is, for some con¬ 
stant X, X At plus terms of order At 2 . 

2 . The probability of more than one event occurring during At is negligible (of 
the order of At 2 ). 

3 . The probability of no events occurring during the interval Af is 1 — XAt plus 
terms of order At 2 . 

Let N(t) denote the number of events occurring in an interval of length t, and let 

Pj.(f) = Prob{N(f) = k) be the probability of exactly k events occurring during an 
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interval of length t (k — 0,1,2,...). Based on (l)-(3), we have 

P k (t + At) P k -i(t)XAt + P k (t)( 1 -1 At) (for k= 1,2,...) 


and 


P 0 (t+At) ^P 0 (t)(l-X At) 


These approximations become more accurate as At —»• 0, and lead to the differen¬ 
tial equations: 

?^p-=XP k _ 1 (t)-XP 1c (t) (for k—1,2,...) 

and 


dPp(f) 
d t 


= —XPo(t) 


Using the initial condition P o(0) = 1, the solution to this set of differential equa¬ 
tions is 


P k (t) — Prob{N(f) = k} 



(for k — 0,1,2,...) 


A process N(t) with this probability distribution is called a Poisson process with 
rate X. A Poisson process with rate X has the following properties: 


1. The expected number of events occurring in an interval of length t is Xt. 

2 . The length of time between consecutive events is an exponentially distributed 
random variable with parameter X and mean value 1 / X. 

3 . The numbers of events occurring in disjoint intervals of time are independent 
of one another. 

4 . The sum of two independent Poisson processes with rates /. ] and X 2 is itself a 
Poisson process with rate X\ + A. 2 . 


As an example for the use of the Poisson process we consider a duplex system, 
consisting of two active identical processors with an unlimited number of spares. 
The two active processors are subject to failures occurring at a constant rate of 
X per processor. The spares, however, are assumed to always be functional (they 
have a negligible failure rate so long as they are not active). 

When a failure occurs in an active processor, it must be detected and a new 
processor inducted into the duplex to replace the one that just failed. As before, 
we define the coverage factor c as the probability of successful detection and in¬ 
duction. We, however, assume for simplicity that the comparator failure rate is 
negligible and that the induction process of a new processor is instantaneous. 

Let us now calculate the reliability of this duplex system over the time interval 
[0, t]. We first concentrate on the failure process in one of the two processors. When 
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a processor fails (due to a permanent fault), it is diagnosed and replaced instan¬ 
taneously. Due to the constant failure rate A, the time between two consecutive 
failures of the same processor is exponentially distributed with parameter A. This 
implies that N(t), the number of failures that occur in this one processor during 
the time interval [0, t], is a Poisson process with the rate A. 

Since the duplex has two active processors, the number of failures that occur in 
the duplex is the sum of the numbers of failures of the two processors, and hence, 
it is also a Poisson process (denoted by M(t)) with rate 2A. The probability that k 
failures occur in the duplex over an interval of duration t is thus 

(2Xt) k 

Probffc failures in duplex} = Prob{M(f) = k] — e~ 2u ——— (2.32) 

For the duplex system not to fail, each of these failures must be detected and the 
processor successfully replaced. The probability of one such success is the cover¬ 
age factor c, and the probability that the system will survive k failures is c k . The 
reliability of the duplex over the interval [0, f] is therefore 


OO 

-Rduplex(0 = £ Prob{/c failures in duplex} • c k 

k= 0 


£e-- 
k =0 


(2A t) k c k 
k\ 


_ e —2Ai (2Xtc) k _ e _2Af g 21fc 

k\ L 

k=0 

= e -2A(l-c)f 

In our derivation, we have used the fact that 

2 OO 

e x = \ +x+ x 

k =0 


k\ 


(2.33) 


We could have obtained the expression in 2.33 more directly using the type of 
reasoning we employed in the analysis of hybrid redundancy. To reiterate, the 
steps are as follows: 

1. Individual processors fail at a rate A, and so processor failures occur in the 
duplex at the rate 2A. 

2 . Each processor failure has a probability c of being successfully dealt with, and 
a probability 1 — c of causing failure to the duplex. 

3 . As a result, failures that crash the duplex occur with rate 2A(1 — c). 

4 . The reliability of the system is thus e _2 ^d-r)f 
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Similar derivations can be made for M-of-N systems in which failing processors 
are identified and replaced from an infinite pool of spares. This is left for the reader 
as an exercise. The extension to the case with only a finite set of spares is simple: 
the summation in the reliability expression is capped at that number of spares, 
rather than going to infinity. 

2.4.2 Markov Models 

In complex systems in which constant failure rates are assumed but combinato¬ 
rial arguments are insufficient for analyzing the reliability of the system, we can 
use Markov models for deriving expressions for the system reliability. In addition, 
Markov models provide a structured approach for the derivation of reliabilities of 
systems that may include coverage factors and a repair process. 

A Markov chain is a special type of a stochastic process. In general, a stochastic 
process X(f) is an infinite number of random variables, indexed by time f. Consider 
now a stochastic process X(f) that must take values from a set (called the state space) 
of discrete quantities, say the integers 0,1,2,.... The process X(f) is called a Markov 
chain if 

Prob{X(f„) = j | X(f 0 ) = io,X(f!) = q,.. • ,X(f n _ 1 ) = i„_i} = Prob{X(f„) = j | X(f„_-|) = i„_i} 
for every fg < t\ < ■ ■ ■ < t n _\ < t n 

If X(t) — i for some t and /', we say that the chain is in state i at time t. We will 
deal only with continuous time, discrete state Markov chains, for which the time 
t is continuous (0 ^ t < oo) but the state X(f) is discrete and integer valued. For 
convenience, we will use as states the integers 0,1,2,_The Markov property im¬ 

plies that in order to predict the future trajectory of a Markov chain, it is sufficient 
to know its present state. This freedom from the need to store the entire history 
of the process is of great practical importance: it makes the problem of analyzing 
Markovian stochastic processes tractable in many cases. 

The probabilistic behavior of a Markov chain can be described as follows. Once 
it moves into some state i, it stays there for a length of time that has an exponential 
distribution with parameter X\. This implies a constant rate X, of leaving state i. 
The probability that, when leaving state i, the chain will move to state j (with j ^ i) 
is denoted by p, ; P// — ' )• The rate of transition from state i to state j is thus 

Mj = Pij^-i ^ij — U 

We denote by P,(f) the probability that the process will be in state i at time t, 
given it started at some initial state io at time 0. Based on the above notations, we 
can derive a set of differential equations for P,(f) (i — 0,1,2,...). 

For a given time instant t, a given state i, and a very small interval of time At, 
the chain can be in state i at time t + At in one of the following cases: 

1. It was in state i at time t and has not moved during the time interval At. This 
event has a probability of P/(f)( 1 — Xj At) plus terms of order At 2 . 
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2. It was at some other state ] at time t (j / i) and moved from j to i during the 
interval At. This event has a probability of P ; (f)/.,; Af plus terms of order At 2 . 


The probability of more than one transition during At is negligible (of order 
At 2 ) if At is small enough. Therefore, for small At, 

Pi(t + At) » P;(f)(l - ki At) + J2 Pj(t)kjiAt 

m 

Again, this approximation becomes more accurate as At —> 0, and results in 

d Pi(t) 


d t 


— k iPi(t ) + y ' hjiPj{t) 


and, since A.,- = ^ij> 


This set of differential equations (for i — 0,1,2,...) can now be solved, using the 
initial conditions P; 0 (0) = 1 and Pj( 0) = 0 for; / Iq (since z'o is the initial state). 

Consider, for example, a duplex system that has a single active processor and 
a single standby spare that is connected only when a fault has been detected in 
the active unit. Let X be the fixed failure rate of each of the processors (when ac¬ 
tive) and let c be the coverage factor. The corresponding Markov chain is shown in 
Figure 2.15. Note that because the integers assigned to the different states are arbi¬ 
trary, we can assign them in such a way that they are meaningful and thus easier 
to remember. In this example, the state represents the number of good processors 
(0,1, or 2, with the initial state being 2 good processors). The differential equations 
describing this Markov chain are: 


dP 2 (f) 

df 

dgiffl 

df 

dPp(f) 

df 


= —AP 2 (f) 


= XcP2{t) — APi(f) 

= k(l-c)P 2 (t) + kP 1 (t) 


Solving 2.34 with the initial conditions P 2 (0) = 1, Pi(0) = Po(0) = 0 yields 
P 2 (f) = e~ kt P 1 (t) = cXte~ xt P 0 (f) = l-Pi(f)-P 2 (f) 


and as a result. 


(2.34) 


Rsystem(f) = l-Po(0=^2(f) + f , l(0 = e kt + cXte U 


(2.35) 
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FIGURE 2.15 The Markov model for the duplex system with an inactive spare. 
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FIGURE 2.16 The Markov model for a duplex system with repair. 


This expression can also be derived based on combinatorial arguments. The 
derivation is left to the reader as an exercise. 

Our next example of a duplex system that can be analyzed using a Markov 
model is a system with two active processors, each with a constant failure rate of 
X and a constant repair rate of /i. The Markov model for this system is depicted in 
Figure 2.16. 

As in the previous example, the state is the number of good processors. The 
differential equations describing this Markov chain are 

^^ = -2 XP 2 (t) + f iP 1 (t) 

= 2 XP 2 (t) + 2/xP 0 (f) - (X + ix)P x (t) 
at 

= XPi ( t ) - 2/rP 0 (t) (2.36) 

dr 

Solving 2.36 with the initial conditions P2(0) = 1, Pi(0) = Pq(0) = 0 yields 


p 2 (t) = 


(X + p) 2 


2k/l c -(X+n)t , k -2 (k+n)t 

(X + fi) 2 (X + p) 2 
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Pl(t) = 


2X/j, 

(X + n) 2 


2X(X - n) {X +U)t 
(X + /x) 2 


Po(t) = l-P 1 (t)-P 2 (t) 


2X 2 

(X + n) 2 


g—2(A+/x)f 


(2.37) 


Note that we solve only for Pi(f) and P 2 (f); using the boundary condition that the 
probabilities must sum up to 1 (for every f) gives us Po(f) and reduces by one the 
number of differential equations to be solved. 

Note also that this system does not fail completely; it is not operational while at 
state 0 but is then repaired and goes back into operation. For a system with repair, 
calculating the availability is more meaningful than calculating the reliability. The 
(point) availability, or the probability that the system is operational at time f, is 

A(t) = P 1 (t) + P 2 (t) 

The reliability R(t), on the other hand, is the probability that the system never 
enters state 0 at any time during [0, t] and cannot be obtained out of the above 
expressions. To obtain this probability, we must modify the Markov chain slightly 
by removing the transition out of state 0, so that state 0 becomes an absorbing state. 
This way, the probability of ever entering the state in the interval [0, f] is reduced 
to the probability of being in state 0 at time f. This probability can be found by 
writing out the differential equations for this new Markov chain, solving them, 
and calculating the reliability as R(t) = 1 — Po(t). 

Since in most applications processors are repaired when they become faulty, the 
long-run availability of the system. A, is a more relevant measure than the reliabil¬ 
ity. To this end, we need to calculate the long-run probabilities, P 2 (oo), Pi(oo), and 
Po(oo). These can be obtained either from Equation 2.37 by letting t approach oo 
or from Equation 2.36 by setting all the derivatives (i = 0,1,2) to 0 and using 
the relationship P 2 (oo) +Pi(oo) + Po(oo) = 1. The availability in the long-run. A, is 
then 


A = P 2 (oo) + Pi(oo) = ^ , 

(X + /x) 2 


2A/z /z(/z -|- 2A) ^ / A 

(A + /x) 2 (A + /x) 2 \X + /x/ 


2.5 Fault-Tolerance Processor-Level Techniques 

All the resilient structures described so far can be applied to a wide range of mod¬ 
ules, from very simple combinatorial logic modules to the most complex micro¬ 
processors or even complete processor boards. Still, duplicating complete proces¬ 
sors that are not used for critical applications introduces a prohibitively large over¬ 
head and is not justified. For such cases, simpler techniques with much smaller 
overheads have been developed. These techniques rely on the fact that proces¬ 
sors execute stored programs and upon an error, the program (or part of it) can 
be re-executed as long as the following two conditions are satisfied: the error is 



2.5 Fault-Tolerance Processor-Level Techniques 


37 



FIGURE 2.17 Error detection using a watchdog processor. 

detected, and the cause of the error is a short-lived transient fault that will most 
likely disappear before the program is re-executed. 

The simplest technique of this type mandates executing every program twice 
and using the results only if the outcomes of the two executions match. This time 
redundancy approach will clearly reduce the performance of the computer by as 
much as 50%. 

The above technique does not require any means for error detection. If a mech¬ 
anism (and suitable circuitry) is provided to detect errors during the execution 
of an instruction, then that instruction can be re-executed, preferably after a cer¬ 
tain delay to allow the transient fault to disappear. Such an instruction retry has a 
considerably lower performance overhead than the brute force re-execution of the 
entire program. 

A different technique for low-cost concurrent error detection without relying 
on time redundancy is through the use of a small and simple processor that will 
monitor the behavior of the main processor. Such a monitoring processor is called 
a zvatchdog processor and is described next. 

2.5.1 Watchdog Processor 

A watchdog processor (see Figure 2.17) performs concurrent system-level error de¬ 
tection by monitoring the system buses connecting the processor and the memory 
This monitoring primarily targets control flow checking, verifying that the main 
processor is executing the correct blocks of code and in the right order. Such mon¬ 
itoring can detect hardware faults and software faults (mistakes/bugs) that cause 
either an erroneous instruction(s) to be executed or a wrong program path to be 
taken. 

To perform the monitoring of the control flow, the watchdog processor must 
be provided with information regarding the program(s) that are to be checked. 
This information is used to verify the correctness of the program(s) execution by 
the main processor in real-time. The information that is provided to the watchdog 
processor is derived from the Control Flow Graph (CFG), which represents the 
control flow of the program to be executed by the main processor (see an exam¬ 
ple of a five-node CFG in Figure 2.18a). A node in this graph represents a block 
of branch-free instructions; no branches are allowed from and into the block. An 













CHAPTER 2 Hardware Fault Tolerance 



(a) A control flow graph (CFG) 
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sig(V5); 

or 

or accept & check 
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sig(V5); 

(b) Checking control flow 

(c) Checking nodes and 
control flow 


FIGURE 2.18 A control flow graph (a) and the corresponding watchdog check programs 
for assigned signatures (b) and for calculated signatures (c). 


edge represents a permissible flow of control, often corresponding to a branch in¬ 
struction. Labels (called signatures) are assigned to the nodes of the CFG and are 
stored in the watchdog processor. During the execution of the program, run-time 
signatures of the executed blocks are generated and compared with the reference 
ones stored in the watchdog processor. If a discrepancy is detected, an error signal 
is generated. 

The signatures of the nodes in the CFG can be either assigned or calculated. As¬ 
signed signatures can simply be successive integers that are stored in the watch¬ 
dog processor along with the CFG. During execution, the signatures of the cur¬ 
rently executed nodes are forwarded to the watchdog processor by the main 
processor. The watchdog processor can then verify that the path taken by the pro¬ 
gram corresponds to a valid path of the given CFG. The program that the watch¬ 
dog processor will execute for the CFG in Figure 2.18a is shown in Figure 2.18b, 
where sig(Vz) is the signature assigned to node Vi. This check program will detect 
an invalid program path such as {VI, V4). Note, however, that an error in one or 
more instructions within a node will not be detected by this scheme. 

To increase the error detection capabilities of the watchdog processor and al¬ 
low it to detect errors in individual instructions, calculated signatures can be used 
instead of assigned ones. For a given node, a signature can be calculated from 
the instructions included in the node by adding (modulo 2) all the instructions in 
the node or using a checksum (see Chapter 3) or another similar code. As before, 
these signatures are stored in the watchdog processor and then compared with 








2.5 Fault-Tolerance Processor-Level Techniques 


39 


the run-time signatures calculated by the watchdog processor while monitoring 
the instructions executed by the main processor. The program that the watchdog 
processor will execute for the CFG in Figure 2.18a with calculated signatures is 
shown in Figure 2.18c. 

Note that most data errors will not be detected by the watchdog processor, since 
the majority of such errors will not cause the program to change its execution 
path. The functionality of the watchdog processor can, in principle, be extended 
to cover a larger portion of data errors by including assertions in the program exe¬ 
cuted by the watchdog processor. Assertions are reasonableness checks that verify 
expected relationships among the variables of the program and, as such, are a 
generalization of acceptance tests. These assertions must be prepared by the ap¬ 
plication programmer and could be made part of the application software rather 
than delegated to the watchdog processor. The performance benefits of having 
the watchdog processor rather than the main processor check the assertions may 
be offset by the need to frequently forward the values of the relevant application 
variables from the main processor to the watchdog processor. In addition, the de¬ 
sign of the watchdog processor becomes more complicated since it needs now to 
be capable of executing arithmetic and logical operations that would otherwise 
not be required. If assertions are not used, then the watchdog processor must be 
supplemented by other error-detection techniques (e.g., parity codes described in 
Chapter 3) to cover data errors. 

One of the quoted advantages of using a watchdog processor for error detection 
is that the checking circuitry is independent of the checked circuitry, thus provid¬ 
ing protection against common or correlated errors. Such a protection can also be 
achieved in duplex structures through the use of design diversity; for example, im¬ 
plementing one of the processors in complementary logic or simply using proces¬ 
sors from different manufacturers. Separation between the watchdog processor 
and the main processor is becoming harder to achieve in current high-end micro¬ 
processors in which simple monitoring of the processor-memory bus is insufficient 
to determine which instructions will eventually be executed and which have been 
fetched speculatively and will be aborted. Furthermore, the current trend to sup¬ 
port simultaneous multithreading greatly increases the complexity of designing 
a watchdog processor. A different technique for concurrent error checking for a 
processor supporting simultaneous multithreading is described next. 

2.5.2 Simultaneous Multithreading for 
Fault Tolerance 

We start this section with a brief overview of simultaneous multithreading. For 
a more detailed description, the reader is invited to consult any good book on 
computer architecture. 

Fligh-end processors today improve speed by exploiting both pipelining and 
parallelism. Parallelism is facilitated by having multiple functional units, with the 
attempt to overlap the execution of as many instructions as possible. Flowever, 
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because of data and control dependencies, most programs have severe limits on 
how much parallelism can actually be uncovered within each thread of execution. 
Indeed, a study of some benchmarks found that on average only about 1.5 instruc¬ 
tions can be overlapped. Therefore, most of the time the majority of the functional 
units will be idle. It is to remedy this problem that the approach of simultaneous 
multithreading (SMT) was born. 

The key idea behind SMT is the following. If data and control dependencies 
limit the amount of parallelism that can be extracted out of individual threads, 
allow the processor to execute multiple threads simultaneously. Note that we are 
not talking about rapid context switches to swap processes in and out: instructions 
from multiple threads are being executed at the same time (in the same clock cycle). 
To support such increased functionality, the architecture must be augmented suit¬ 
ably. A program counter register is needed for each of the threads that the system 
is simultaneously executing. If the instruction set specifies a /c-register architecture 
and we want to execute n threads simultaneously, at least nk physical registers are 
needed (so that there is one A:-register set for each of the n threads). These are just 
the externally-visible registers: most high-end architectures have a larger number 
of internal registers that are not "visible" to the instruction set to facilitate register 
renaming and thereby improve performance. Unlike the nk architectural registers, 
the internal renaming registers are shared by all simultaneously executing threads, 
which also share a common issue queue. A suitable policy must be implemented 
for fetching and issuing instructions and for assigning internal registers and other 
resources so that no thread is starved. 

How is this different from just running the workload on a multiprocessor con¬ 
sisting of n traditional processors? The answer lies in the way the resources can 
be assigned. In the traditional multiprocessor, each processor will be running an 
individual thread, which will have access to just the functional units and rename 
registers associated with that processor. In the SMT, we have a set of threads that 
have access to a pool of functional units and rename registers. The usage of these 
entities will depend on the available parallelism within each thread at the moment; 
it can change with time, as the resource requirements and inherent parallelism lev¬ 
els change in each simultaneously executing thread. 

To take advantage of the multithreading capability for fault-detection purposes, 
two independent threads are created for every thread that the application wants to 
run. These threads execute identical code, and care is taken to ensure that they 
receive exactly the same inputs. If all is well, they must both produce the same 
output: a divergence in output signals a fault, and appropriate steps must be taken 
for recovery. The idea is to provide almost the same amount of protection against 
transient faults as can be obtained from a traditional approach that runs a total of 
two copies of the program independently. 

To reduce the performance penalty of re-execution, the second execution of the 
program always trails the first. Call these two executions the leading and the trail¬ 
ing copies of the program, respectively. The advantage of doing this is that infor¬ 
mation can be passed from the leading to the trailing copy to make the trailing 
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Sphere of Replication 



FIGURE 2.19 Sphere of replication. 

copy run faster and consume less computational resources. For example, the lead¬ 
ing copy can tell the trailing copy the outcome of conditional branches so that 
the trailer never makes an incorrect branch guess, or the leading copy can make 
loading faster for the trailer. 

To support the two independent but identical threads, two different sets of sev¬ 
eral hardware components must be assigned to these threads. For example, two 
sets of the architectural registers must be used so that a fault in a register being 
used by one thread will have no impact on the execution of the other thread. 

This leads to the concept of the sphere of replication. Items that are replicated for 
the two threads are said to be within this sphere; items that are not replicated are 
outside. Data flows across the surface of this sphere (see Figure 2.19). Items that 
are replicated use such redundancy as a means for fault tolerance and are within 
the sphere of replication; items that are not must use some other means (such as 
error-correcting codes) to protect against the impact of faults. We can decide what 
items fall within the sphere of replication based on the cost or overhead that they 
entail and the effectiveness with which other fault-tolerance techniques can protect 
them should they be kept outside it. For example, providing two copies of the 
instruction and data caches may be too expensive, and so, one can rely instead on 
error-correcting codes to protect their contents. 


Byzantine Failures 

We have so far classified failures according to their temporal behavior: are they 
permanent or do they go away after some time? We will now introduce another 
important classification, based on how the failed unit behaves. 

It is usually assumed that when a unit fails, it goes dead. The picture many 
people have in their minds is that of a lightbulb, which fails by burning out. If all 
devices behaved that way when they failed, dealing with failures would be rela¬ 
tively simple. Flowever, devices in general, and processors in particular, can suffer 
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FIGURE 2.20 Network for Byzantine example. 

malicious failures in which they produce arbitrary outputs. Such failures are known 
as Byzantine failures, and are described below. These failures cause no problem in 
an M-of-N system with voting since the voter acts as a centralizing entity, masking 
out the erroneous outputs. However, when processors are used in a truly distrib¬ 
uted way without such a centralizing entity, Byzantine failures can cause subtle 
problems. 

To see this, consider the following example. A sensor is providing temperature 
information to two processors through point-to-point links between them (see Fig¬ 
ure 2.20). The sensor has suffered a Byzantine failure and tells processor Pi that the 
temperature is 25° while telling P 2 that it is 45°. Now, is there any way in which Pi 
and Po can figure out that the sensor is faulty? The best they can do is to exchange 
the messages they have received from the sensor: P 1 tells P 2 that it got 25°, and P 2 
tells Pi that it got 45°. At this point, both processors know that something is wrong 
in the system, but neither can figure out which unit is malfunctioning. As far as Pi 
is concerned, the input it received from the sensor contradicts the input from P 2 ; 
however, it has no way of knowing whether it is the sensor or P 2 that is faulty. P 2 
has a similar problem. No number of additional communications between P 1 and 
P 2 can solve this problem. 

This is known as the Byzantine Generals problem, since an early paper in this 
field used as a model a general communicating his attack plans to his lieutenants 
by messengers. A traitorous commander could send contradictory messages to his 
lieutenants, or one or more of the lieutenants could be disloyal and misrepresent 
the commander's orders and get some divisions to attack and others to retreat. 
The objective is to get all the loyal lieutenants to agree on the commander's order. 
If the commanding general is loyal, the order the loyal lieutenants agree on must 
be the order that the commander sent. Traitorous officers can lie about the order 
they received. 

The solution to this problem is the Byzantine Generals algorithm (also known 
as the Interactive Consistency algorithm). The model is that of a single entity (such 
as a sensor or processor) disseminating the value of some variable to a set of re- 
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ceivers. The receivers can communicate among themselves to exchange informa¬ 
tion about the value they received from the original source. If a unit is functional, 
it will be truthful in all its messages; a faulty unit may behave arbitrarily. This arbi¬ 
trary behavior includes the possibility of sending out contradictory messages. All 
communications are time-bounded, i.e., the absence of a message can be detected 
by a time-out mechanism. The goal of the algorithm is to satisfy the following 
interactive consistency conditions: 

IC1. All functional (non-faulty) units must arrive at an agreement of the value 
that was transmitted by the original source. 

IC2. If the original source is functional, the value they agree on must be the value 
that was sent out by the original source. 

There are many algorithms to solve the Byzantine Generals problem. We will 
present here the original algorithm, because it is the simplest. More recent algo¬ 
rithms are referenced in the Further Reading section. 

The algorithm is recursive. Let there be N units in all (one original source and 
N — 1 receivers), of which up to m may be faulty It is possible to show that inter¬ 
active consistency can only be obtained when N ^ 3 m +1. If N ^ 3m, no algorithm 
can be constructed that satisfies the interactive consistency conditions. 

The algorithm Byz(N,m) consists of the following three steps: 

Step 1. The original source disseminates the information to each of the N — 1 
receivers. 

Step 2. If m > 0, each of the N — 1 receivers now acts as an original source to dis¬ 
seminate the value that it received in the previous step. To do this, each receiver 
runs the Byz(N — 1, m — 1) algorithm, and sends out its received value to the 
other N — 2 receivers. If a unit does not get a message from another unit, it as¬ 
sumes the default message was sent and so enters the default into its records. If 
m — 0, this step is bypassed. 

Step 3. At the end of the preceding step, each receiver has a vector, containing 
the agreed values received (a) from the original source, and ( b) from each of the 
other receivers (if m > 0). If m > 0, each receiver takes a vote over the values 
contained in its vector, and this is used as the value that was transmitted by 
the original source. If no majority exists, a default value is used. If m = 0, the 
receiver simply uses the value it received from the original source. 

Note that we assume that all units have a timer available to them and a timeout 
mechanism to detect the absence (or loss) of a message. Otherwise, a faulty node 
could cause the entire system to be suspended indefinitely by remaining silent. 

Let us consider some examples of this algorithm. We will use the following 
notations: 


■ If A and B are units, then A.B(n ) means that A sent B the message n. 
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■ If U is a string of units A\,A 2 ,... ,A m , and B is a unit, then U.B(n) means 
that B received the message n from A m who claims to have received it from 
A m _i and so on. 

■ A message that is not sent is denoted by <p. For example, A.B(q>) means that 
the message that A was supposed to send B was never sent. 

For example, A.B.C(n) represents the fact that B told C that the value it received 
from A was n. Similarly, A.B.C.D(n) mean that D received the message n from 
C who claims to have received it from B who, in turn, claims to have received it 
from A. The string of units thus represents a chain along which the given message, 
n, has passed. For example. Black.White.Green(341) means that Green received the 
message 341 from White who claims to have received it from Black. 


■ EXAMPLE 

Consider the degenerate case of the algorithm when m — 0, i.e., no fault toler¬ 
ance is provided. In such a case, step 2 is bypassed, and the interactive consis¬ 
tency vector consists of a single value: the one that has been received from the 
original source. ■ 


Consider now the case where m = 1. We must have at least 3m + 1=4 units 
participating in this algorithm. Our model in this example consists of a sen¬ 
sor, S, and three receivers, R\, R 2 , and R 3 . Suppose the sensor is faulty and 
sends out inconsistent messages to the receivers: S.Ri(l), S./GO b S.R 3 (0). All 
the receivers are functional, and the default is assumed to be 1 . 

In the second step of the algorithm, R \, R 2 , and R 3 each acts as the source for 
the message it received from the sensor and runs By z(3,0) on it. That is, the 
following messages are sent: 

S.Ri J? 2 (l) S.RM1) 

S.RM 1) s.r 2 .r 3 ( 1) 

S.R 3 .Ri(0) s.r 3 .r 2 ( 0) 


Define an Interactive Consistency Vector (ICV) at receiver R, as {x \, x l 2 , 
where 



Report of R as determined by R, if i ^ j 

Value received from the original source if i = j 


,x 


i 

N -1 


), 


At the end of this step, the ICVs are each (1,1,0) at every receiver. Taking the 
majority vote over this yields 1 , which is the value used by each of them. ■ 
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■ EXAMPLE 

Let N —7, m = 2, but this time let receivers R\ and be faulty and the other 
units (S, R 2 , R 3 , R 4 , R 5 ) be functional. The messages sent out in the first round 
by S are consistent: S.Ri(l), S. R 2 (l), S.R 3 (1), S.R 4 (1), S.i^s(l), and S.R6(1). Each 
of the receivers now executes Byz( 6 ,1) in step 2 of the Byz(7, 2) algorithm. 
Consider R] first. This unit is faulty and can send out any message it likes (or 
even nothing at all). Suppose it sends out the following messages in step 1 of 
the Byz( 6 ,1) algorithm for all receivers to agree on its value: 

S.Ri.R 2 (1) S.Ri.R 3 (2) S.Ri.R 4 (3) S.Ri.R 5 (4) S.Ri.R 6 (0) 


In step 2 of this Byz( 6 , 1) algorithm, each of the remaining receivers (R 2 , R 3 , R 4 , 
R 3 / Re) uses the Byz(5,0) algorithm to disseminate the message it received 
from Rj. The following are the messages: 


S.Ri.R 2 .R 3 (l) 

S.Ri.R 3 .R 2 (2) 

S.Ri.R 4 .R 2 (3) 

S.Ri.R 5 .R 2 (4) 

S.Ri.R 6 .R 2 (1) 


S.Ri.R9.R 4 (1) 

S.R 1 .R 3 .R 4 (2) 

S.R } .RM3) 

S.Ri.R 5 .R 3 (4) 

S.R\.R(,.R 3 ( 8 ) 


S.Rt.Rz-Rsil) 
S.Ri.R 3 .R 5 ( 2) 
S.Ri.R 4 .R 5 (3) 
S.Ri.R 5 .R 4 (4) 
S.Ri.R 6 .R 4 (0) 


S.Ri.R 2 .R 6 (1) 

S.Ri.R 3 .R 6 (2) 

S.Ri.R 4 .R 6 (3) 

S.Ri.R 5 .R 6 (4) 

S.Ri.R 6 .R 5 (^) 


Note that Rg being maliciously faulty is free to send out anything it likes. 

The ICVs maintained at each of the receivers in connection with the S.Ri(l) 
message are: 


ICVs.Rj(R 2 ) = ( 1 , 2 ,3,4,1) 

ICVs.r 1 (R 3 ) = ( 1 , 2 ,3,4, 8 ) 

ICVs.r 1 (R 4 ) = ( 1 , 2 ,3,4,0) 

ICV s . Rl (R 5 ) = (l,2,3,4,0) 

ICVs.R^Rg) is irrelevant, since R/-, is faulty. Also, note that since R 5 received 
nothing from Rg, its value is recorded as the default, say 0 . 

When R 2 ,R 3 ,R 4 , and R 5 examine their ICVs, they find no majority and there¬ 
fore assume the default value for S.R\. This default is zero, and so each of 
these receivers records that the message that S sent R \ is agreed to be 0. 
Similarly, agreement can be reached on the message that S sent to each of the 
other receivers (the reader is encouraged to write out the messages). This com¬ 
pletes the generation of the ICVs connected with the original Byz(7, 2) algo¬ 
rithm. ■ 


Let us now prove that algorithm Byz does indeed satisfy the Interactive Consis¬ 
tency conditions, IC1 and IC2 if N ^ 3m + 1. We proceed by induction on m. The 
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induction hypothesis is that the theorem holds for all m st M for some M V 0. We 
now consider two cases. 

Case 1. The original source is non-faulty. 

We show by induction that whenever the original source is nonfaulty, algorithm 
Byz(N, ill) satisfies IC2 if there are more than 2k + ill nodes and at most k faulty 
elements. The proof is by induction on m. Assume the result holds for all in V M 
and consider the case m — M +1. 

In the first step, the original source sends out its message to each of the other 
processors. Since the source is nonfaulty, all processors receive consistent mes¬ 
sages. 

In the second step, each processor runs Byz(N — l,m — 1) to disseminate the 
message it received from the original source. Since N > 2k + m, we have N — 1 > 
2k + m — 1. Hence, by the induction hypothesis, executing Byz(N — l,m — 1) is suf¬ 
ficient to permit all correct processors to disseminate the messages they received. 

Now, set k — m. Since there are at most m faulty elements, a majority of the 
processors is functional. Hence, the majority vote on the values disseminated will 
result in a consistent value being produced by each correct processor. 

Case 2. The original source is faulty. 

If the original source is faulty, at most m — 1 other processors can be faulty. 

In step 1, the original source can send out any message it likes to each of the 
other processors. There are N — 1 ^ 3 (in — 1) + 1 other processors. Hence, when 
these processors run Byz(N — 1 ,m — 1) among the N — 1 other processors, by the 
induction hypothesis, each processor will have consistent entries in its ICV for 
each of them. The only entry in the ICV that can differ is that corresponding to the 
original source. Therefore, when the majority function is applied to each ICV, the 
result is the same, and the proof is completed. 

We have shown that N V 3 in + 1 is a sufficient condition for Byzantine agree¬ 
ment. We did this by construction, i.e., by presenting an algorithm that achieved 
consistency under these conditions. It also turns out that this condition is neces¬ 
sary. That is, under the condition of two-party messages and arbitrary failures, it 
is impossible for any algorithm to guarantee that conditions IC1 and IC2 will be 
met if N < 3m. 

2.6.1 Byzantine Agreement with 
Message Authentication 

The Byzantine Generals problem is hard because faulty processors could lie about 
the message they received. Let us now remove this possibility by providing some 
mechanism to authenticate the messages. That is, suppose each processor can ap¬ 
pend to its messages an unforgeable signature. Before forwarding a message, a 
processor appends its own signature to the message it received. The recipient can 
check the authenticity of each signature. Thus, if a processor receives a message 
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that has been forwarded through processors A and B, it can check to see whether 
the signatures of A and B have been appended to the message and if they are valid. 
Once again, we assume that all processors have timers so that they can time out 
any (faulty) processor that remains silent. 

In such a case, maintaining interactive consistency becomes very easy. Here is 
an algorithm that does so: 

Algorithm. AByz(N,m) 

Step Al. The original source signs its message i// and sends it out to each of the 
processors. 

Step A2. Each processor i that receives a signed message i//: A, where A is the set 
of signatures appended to the message \[r, checks the number of signatures in A. 
If this number is less than m + 1, it sends out i,// : A U {1} (i.e., what it received 
plus its own signature) to each of the processors not in set A. It also adds this 
message, i/t, to its list of received messages. 

Step A3: When a processor has seen the signatures of every other processor (or 
has timed out), it applies some decision function to select from among the mes¬ 
sages it has received. 

Let us now show that the algorithm maintains Byzantine agreement for any num¬ 
ber of processors. Clearly, if N ^ m + 2, the problem becomes trivial. 

As before, we consider two cases. 

Case 1. The original source is functional. 

In such a case, an identical signed message (say, ji) is transmitted by the orig¬ 
inal source to every processor in the system. Since nobody can forge the original 
source's signature, no processor will accept any message other than ji in step A2 
(any corruption of a message will, by definition, be detected). As a result, it will 
correctly select // as the message disseminated by the original source. 

Case 2 . The original source is faulty. 

In this case, different messages may be sent out to different processors, each 
with the original source's correct signature. We now show that the list of received 
messages (minus the signatures) is the same at each nonfaulty processor. 

Let us proceed by contradiction. Suppose this is not true, and in particular, the 
sets at nonfaulty processors i and j (call them , P, and » Pj) are different. Let \l/\ be a 
message in Pj but not in 'Pj. 

Since processor i did not pass on i/g to processor /, i//j must have had at least 
m + 1 signatures appended to it. Let i be one of these signatures. When processor 
l received \jr\, fs signature was not appended to i//|, and the list of signatures 
would have been less than m + 1 long. Hence, processor i would have forwarded 
the message to /, and so e *Pj, establishing the desired contradiction. 
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2.7 Further Reading 

An excellent introduction to the basics of hardware fault tolerance can be found in 
[24], Some basic definitions can be found in [2]. Hardware failure rate models are 
described in [27], The topic of hardware /logic circuits testing is covered in many 
textbooks, e.g., [1,8]. 

Readers who are weak in probability theory may have found some of the re¬ 
liability derivations difficult to understand. A very readable source for the math¬ 
ematical background associated with such probabilistic calculations is [26]. The 
textbook [6] is quite dated, but is still very useful as a detailed and advanced in¬ 
troduction to reliability models. [10] contains a description of reliability models in 
addition to a guide to statistical methods. 

One approach to representing the dependence of overall system reliability on 
the health of individual modules is fault trees. For details, see [5,29]. 

Voting techniques have been the focus of some work in the literature: a good 
comprehensive reference is [14] with more recent work reported in [3,7,19]. Com¬ 
pensating faults in NMR structures were introduced in [23] and an analysis of hy¬ 
brid redundancy with compensating faults appears in [12]. The sift-out modular 
redundancy is described in [25]. 

Various techniques for processor error checking by watchdog processors have 
been described in the literature. An excellent survey with an extensive list of ref¬ 
erences appears in [16]. The capabilities of watchdog processors were extended to 
include checking of memory accesses in [18]. Other signatures generation schemes 
for checking the program control flow based on the use of M-of-N codes (see Chap¬ 
ter 3), have been described in [28]. The exploitation of multithreading techniques 
for fault tolerance is discussed in [17,22,30]. 

There is an extensive bibliography on Byzantine Generals algorithms. See, for 
example, [9,11,13,15,20]. A good survey can be found in [4]. 

2.8 Exercises 

1. The lifetime (measured in years) of a processor is exponentially distributed, 
with a mean lifetime of 2 years. You are told that a processor failed some¬ 
time in the interval [4,8] years. Given this information, what is the conditional 
probability that it failed before it was 5 years old? 

2. The lifetime of a processor (measured in years) follows the Weibull distribu¬ 
tion, with parameters X = 0.5 and f3 — 0.6. 

a. What is the probability that it will fail in its first year of operation? 

b. Suppose it is still functional after t — 6 years of operation. What is the 
conditional probability that it will fail in the next year? 


C. Repeat parts a and b for ft — 2. 
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FIGURE 2.21 A 5-module series-parallel system. 



FIGURE 2.22 A 7-module series-parallel system. 

d. Repeat parts a and b for ft = 1. 

3. To get a feel for the failure rates associated with the Weibull distribution, plot 
them for the following parameter values as a function of the time, t: 

a. Fix X = 1 and plot the failure rate curves for ft = 0.5,1.0,1.5. 

b. Fix ft — 1.5 and plot the failure rate curves for X — 1,2,5. 

4. Write the expression for the reliability R S ystem(f) of the series/parallel system 
shown in Figure 2.21, assuming that each of the five modules has a reliability 
of R(f). 

5. The lifetime of each of the seven blocks in Figure 2.22 is exponentially distrib¬ 
uted with parameter X. Derive an expression for the reliability function of the 
system, R s y S tem(0, and plot it over the range f = [0,100] for X — 0.02. 

6 . Consider a triplex that produces a 1-bit output. Failures that cause the output 
of a processor to be permanently stuck at 0 or stuck at 1 occur at constant rates 
/.() and /. |, respectively The voter never fails. At time f, you carry out a cal- 
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culation the correct output of which should be 0. What is the probability that 
the triplex will produce an incorrect result? (Assume that stuck-at faults are 
the only ones that a processor can suffer from, and that these are permanent 
faults; once a processor has its output stuck at some logic value, it remains 
stuck at that value forever). 

7. Write the expression for the reliability of a 5MR system and calculate its MTTF. 
Assume that failures occur as a Poisson process with rate X per node, that 
failures are independent and permanent, and that the voter is failure-free. 

8 . Consider an NMR system that produces an eight-bit output. N = 2m + 1 for 
some m. Each processor fails at a constant rate X and the failures are perma¬ 
nent. A failed processor produces any of the 2 8 possible outputs with equal 
probability. A majority voter is used to produce the overall output, and the 
voter is assumed never to fail. What is the probability that, at time f, a major¬ 
ity of the processors produce the same incorrect output after executing some 
program? 

9. Design a majority voter circuit out of two- and three-input logic gates. Assume 
that you are voting on 1 -bit inputs. 

10. Derive an expression for the reliability of the voter you designed in the pre¬ 
vious question. Assume that, for a given time f, the output of each gate is 
stuck-at -0 or stuck-at -1 with probability Pq and P respectively (and is fault- 
free with probability 1 — Po —Pi)- What is the probability that the output of 
your voter circuit is stuck-at -0 (stuck-at- 1 ) given that the three inputs to the 
voter are fault-free and do change between 000 and 111 ? 

11. Show that the MTTF of a parallel system of N modules, each of which suffers 
permanent failures at a rate X, is MTTF p = Ya=i K • 

12. Consider a system consisting of two subsystems in series. For improved relia¬ 
bility, you can build subsystem i as a parallel system with k; units, for i = 1,2. 
Suppose permanent failures occur at a constant rate X per unit. 

a. Derive an expression for the reliability of this system. 

b. Obtain an expression for the MTTF of this system with k\—2 and l <2 = 3. 

13. List the conditions under which the processor/memory TMR configuration 
shown in Figure 2.9 will fail, and compare them to a straightforward TMR 
configuration with three units, in which each unit consists of a processor and 
a memory. Denote by R p , R m , and R v the reliability of a processor, a memory, 
and a voter, respectively, and write expressions for the reliability of the two 
TMR configurations. 

14. Write expressions for the upper and lower bounds and the exact reliability 
of the following non-series/parallel system shown in Figure 2.23 (denote by 
R/(f) the reliability of module i). Assume that D is a bidirectional unit. 
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FIGURE 2.23 A 6-module non-series/parallel system. 



FIGURE 2.24 A TMR with a spare. 

15. The system shown in Figure 2.24 consists of a TMR core with a single spare 
a that can serve as a spare only for module 1 . Assume that modules 1 and a 
are active. When either of the two modules 1 or fl fails, the failure is detected 
by the perfect comparator C, and the single operational module is used to 
provide an input to the voter. 

a. Assuming that the voter is perfect as well, which one of the following ex¬ 
pressions for the system reliability is correct (where each module has a 
reliability R and the modules are independent). 

1 ■ Rsystem = R 4 + 4R 3 (1 - R) + 3R 2 (1 - R ) 2 

2 . Rsystem = R 4 + 4R 3 (1 — R) + 4R 2 (1 - R) 2 

3. Rsystem = R 4 + 4R 3 (1-R) + 5R 2 (1-R) 2 

4. Rsystem = R 4 + 4R 3 (1-R) + 6R 2 (1-R) 2 
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b. Write an expression for the reliability of the system if instead of a per¬ 
fect comparator for modules 1 and a, there is a coverage factor c ( c is the 
probability that a failure in one module is detected, the faulty module is 
correctly identified, and the operational module is successfully connected 
to the voter that is still perfect). 

16. A duplex system consists of two active units and a comparator. Assume that 
each unit has a failure rate of X and a repair rate of fi. The outputs of the two 
active units are compared, and when a mismatch is detected, a procedure to 
locate the faulty unit is performed. The probability that upon a failure, the 
faulty unit is correctly identified and the fault-free unit (and consequently, the 
system) continues to run properly is the coverage factor c. Note that when 
a coverage failure occurs, the entire system fails and both units have to be 
repaired (at a rate /i each). When the repair of one unit is complete, the system 
becomes operational and the repair of the second unit continues, allowing the 
system to return to its original state. 

a. Show the Markov model for this duplex system. 

b. Derive an expression for the long-term availability of the system assuming 
that p = IX. 

17. a. Your manager in the Reliability and Quality Department asked you to ver¬ 

ify her calculation of the reliability of a certain system. The equation that 
she derived is 

^system = R C [1 - (1 - Ra)( 1 - Rb)] [1 - (1 - Rd)( 1 - Re)] 

+ (1 - Rc)[ 1 - (1 - RaRdX 1 - RbRe)] 

However, she lost the system diagram. Can you draw the diagram based 
on the expression above? 

b. Write expressions for the upper and lower bounds on the reliability of 
the system and calculate these values and the exact reliability for the case 
Ra = Rb = Rc = Rd = Re = R = 0.9. 

18. A duplex system consists of a switching circuit and two computing units: an 
active unit with a failure rate of X\ and a standby idle unit that has a lower 
failure rate X 2 < X \ while idle. The switching circuit frequently tests the active 
unit, and when a fault is detected, the faulty unit is switched out, and the 
second unit is switched in and becomes fully operational with a failure rate 
X\. The probability that upon a failure, the fault is correctly detected and the 
fault-free idle unit resumes the computation successfully is denoted by c (the 
coverage factor). Note that when a coverage failure occurs, the entire system 
fails. 

a. Show the Markov model for this duplex system (hint: three states are suf¬ 
ficient). 
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b. Write the differential equations for the Markov model and derive an ex¬ 
pression for the reliability of the system. 

19. You have a processor susceptible only to transient failures which occur at a 
rate of X per second. The lifetime of a transient fault (measured in seconds) is 
exponentially distributed with parameter //. Your fault-tolerance mechanism 
consists of running each task twice on this processor, with the second exe¬ 
cution starting r seconds after the first. The executions take s seconds each 
(r > s). Find the probability that the output of the first execution is correct, 
but that of the second execution is incorrect. 
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Information 

Redundancy 


Errors in data may occur when the data are being transferred from one unit to 
another, from one system to another, or even while the data are stored in a mem¬ 
ory unit. To tolerate such errors, we introduce redundancy into the data: this is 
called information redundancy. The most common form of information redundancy 
is coding, which adds check bits to the data, allowing us to verify the correctness 
of the data before using it and, in some cases, even allowing the correction of the 
erroneous data bits. Several commonly used error-detecting and error-correcting 
codes are discussed in Section 3.1. 

Introducing information redundancy through coding is not limited to the level 
of individual data words but can be extended to provide fault tolerance for larger 
data structures. The best-known example of such a use is the Redundant Array of 
Independent Disks (RAID) storage system. Various RAID organizations are pre¬ 
sented in Section 3.2, and the resulting improvements in reliability and availability 
are analyzed. 

In a distributed system where the same data sets may be needed by different 
nodes in the system, data replication may help with data accessibility. Keeping 
a copy of the data on just a single node could cause this node to become a per¬ 
formance bottleneck and leave the data vulnerable to the failure of that node. An 
alternative approach would be to keep identical copies of the data on multiple 
nodes. Several schemes for managing the replicated copies of the same data are 
presented in Section 3.3. 
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We conclude this chapter with a description of algorithm-based fault tolerance 
which can be an efficient information redundancy technique for applications that 
process large arrays of data elements. 

Coding 

Coding is an established area of research and practice, especially in the communi¬ 
cation field, and many textbooks on this topic are available (see the Further Read¬ 
ing section). Here, we limit ourselves to a brief survey of the more common codes. 

When coding, a d-bit data word is encoded into a c-bit codeword, which con¬ 
sists of a larger number of bits than the original data word, i.e., c > d. This encod¬ 
ing introduces information redundancy, that is, we use more bits than absolutely 
needed. A consequence of this information redundancy is that not all 2 C binary 
combinations of the c bits are valid codewords. As a result, when attempting to de¬ 
code the c-bit word to extract the original d data bits, we may encounter an invalid 
codeword and this will indicate that an error has occurred. For certain encoding 
schemes, some types of errors can even be corrected and not just detected. 

A code is defined as the set of all permissible codewords. Key performance 
parameters of a code are the number of erroneous bits that can be detected as 
erroneous, and the number of errors that can be corrected. The overhead imposed 
by the code is measured in terms of both the additional bits that are required and 
the time needed to encode and decode. 

An important metric of the space of codewords is the Hamming distance. The 
Hamming distance between two codewords is the number of bit positions in 
which the two words differ. Figure 3.1 shows the eight 3-bit binary words. Two 
words in this figure are connected by an edge if their Hamming distance is 1. The 
words 101 and Oil differ in two bit positions and have, therefore, a Hamming dis¬ 
tance of 2; one has to traverse two edges in Figure 3.1 to get from node 101 to node 
Oil. Suppose two valid codewords differ in only the least significant bit position, 
for example, 101 and 100. In this case, a single error in the least significant bit in 



FIGURE 3.1 Hamming distances in the 3-bit word space. 
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either one of these two codewords will go undetected, since the erroneous word 
is also an existing codeword. In contrast, a Hamming distance of two (or more) 
between two codewords guarantees that a single-bit error in any of the two words 
will not change it into the other. 

The code distance is the minimum Hamming distance between any two valid 
codewords. The code that consists of the four codewords {001, 010, 100, 111}, 
marked by circles in Figure 3.1, has a distance of 2 and is, therefore, capable of de¬ 
tecting any single-bit error. The code that consists only of the codewords {000, 111} 
has a distance of 3 and is, therefore, capable of detecting any single- or double-bit 
error. If double-bit errors are not likely to happen, this code can be used to correct 
any single-bit error. In general, to detect up to k-bit errors, the code distance must 
be at least k+ 1, whereas to correct up to k -bit errors, the code distance must be at 
least 2k + 1. The code {000,111} can be used to encode a single data bit with 0 (for 
example) encoded as 000 and 1 as 111. This code is similar to the TMR redundancy 
technique, which was discussed in Chapter 2. In principle, many redundancy tech¬ 
niques can be considered as coding schemes. A duplex, for example, can be con¬ 
sidered as a code whose valid codewords consist of two identical data words. For 
a single data bit, the codewords will be 00 and 11. 

Another important property of codes is separability. A separable code has sep¬ 
arate fields for the data and the check bits. Therefore, decoding for a separable 
code simply consists of selecting the data bits and disregarding the check bits. The 
check bits must still be processed separately to verify the correctness of the data. 
A nonseparable code, on the other hand, has the data and check bits integrated to¬ 
gether, and extracting the data from the encoded word requires some processing, 
thus incurring an additional delay. Both types of codes are covered in this chapter. 

3.1.1 Parity Codes 

Perhaps the simplest codes of all are the parity codes. In its most basic form, a 
parity-coded word includes d data bits and an extra (check) bit that holds the par¬ 
ity. In an even (odd) parity code, this extra bit is set so that the total number of Is 
in the whole (d + l)-bit word (including the parity bit) is even (odd). The overhead 
fraction of the parity code is 1 /d. 

A parity code has a Hamming distance of 2 and is guaranteed to detect all 
single-bit errors. If a bit flips from 0 to 1 (or vice versa), the overall parity will no 
longer be the same, and the error can be detected. However, simple parity cannot 
correct any bit errors. 

Since the parity code is a separable code, it is easy to design parity encoding 
and decoding circuits for it. Figure 3.2 shows circuits to encode and decode 5-bit 
data words. The encoder consists of a five-input modulo-2 adder, which generates 
a 0 if the number of Is is even. The output of this adder is the parity signal for 
the even parity code. The decoder generates the parity from the received data bits 
and compares this generated parity with the received parity bit. If they match, the 
output of the rightmost Exclusive-OR (XOR) gate is a 0, indicating that no error 
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Parity Bit 



Error Signal 



FIGURE 3.2 Even parity encoding and decoding circuits. 

has been detected. If they do not match, the output is a 1, indicating an error. Note 
that double-bit errors cannot be detected by a parity check. However, all three (and 
in general, any odd number of) bit errors will be detected. 

The choice of even parity or odd parity depends on which type of all-bits uni¬ 
directional error (i.e., all-Os or all-ls error) is more probable. If, for example, we 
select the even parity code, the parity bit generated for the all zeroes data word 
will be 0. In such a case, an all-Os failure will go undetected because it is a valid 
codeword. Selecting the odd parity code will allow the detection of the all-Os fail¬ 
ure. If, on the other hand, the all-ls failure is more likely than is the all-Os failure, 
we have to make sure that the all-ls word (data and parity bit) is invalid. To this 
end, we should select the odd parity code if the total number of bits (including the 
parity bit) is even and vice versa. 

Several variations of the basic parity code have been proposed and imple¬ 
mented. One of these is the parity-bit-per-byte technique. Instead of having a sin¬ 
gle parity bit for the entire data word, we assign a separate parity bit to every 
byte (or any other group of bits). This will increase the overhead from 1/d to m/d, 
where m is the number of bytes (or other equal-sized groups). On the other hand, 
up to m errors will be detected as long as they occur in different bytes. If the all- 
Os and all-ls failures are likely to happen, we can select the odd parity code for 
one byte and the even parity code for another byte. A variation of the above is the 
byte-interlaced parity code. For example, suppose that d = 64 and denote the data 
bits by 63/^62/ • • ■,«(.)■ Use eight parity bits such that the first will be the parity bit 
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0 0 0 1 1 1 1 
10 10 110 
1 1 0 0 0 0 0 
0 0 0 1 1 1 1 
1111110 
1 0 0 1 0 0 0 


FIGURE 3.3 Example of overlapping parity. 

of fl63, a 55 r (I 47 , 039 , fl 3 i, (I 23 , «i 5 and ay, i.e., all the most significant bits in the eight 
bytes. Similarly, the remaining seven parity bits will be assigned so that the corre¬ 
sponding groups of bits are interlaced. Such a scheme is beneficial when shorting 
of adjacent bits is a common failure mode (e.g., in a bus). If, in addition, the par¬ 
ity type (odd or even) is alternated between the groups, the unidirectional errors 
(all-Os and all-ls) will also be detected. 

An extension of the parity concept can render the code error correcting as well. 
The simplest such scheme involves organizing the data in a two-dimensional array 
as shown in Figure 3.3. The parity bits are shown in boldface. The bit at the end of 
a row represents the parity over this row; a bit at the bottom row is the parity bit 
for the corresponding column. The even parity scheme is followed for both rows 
and columns in Figure 3.3. A single-bit error anywhere will result in a row and a 
column being identified as erroneous. Because every row and column intersect in 
a unique bit position, the erroneous bit can be identified and corrected. 

The above was an example of overlapping parity, in which each bit is "covered" 
by more than one parity bit. We next describe the general theory associated with 
overlapping parity Our aim is to be able to identify every single erroneous bit. 
Suppose there are d data bits in all. Flow many parity bits should be used and 
which bits should be covered by each parity bit? 

Let r be the number of parity bits (check bits) that we add to the d data bits 
resulting in codewords of size d + r bits. Flence, there are d + r error states, where 
in state i the ith bit of the codeword is erroneous (keep in mind that we are dealing 
only with single-bit errors: this scheme will not detect all double-bit errors). In 
addition, there is the state in which no bit is erroneous, resulting in d + r +1 states 
to be distinguished. 

We detect faults by performing r parity checks, that is, for each parity bit, we 
check whether the overall parity of this parity bit and the data bits covered by it 
is correct. These r parity checks can generate up to 21 different check outcomes. 
Hence, the minimum number of parity bits is the smallest r that satisfies the fol¬ 
lowing inequality 

2'>d + r + l (3.1) 

How do we decide which data bits will be covered by each parity bit? We asso¬ 
ciate each of the d + r+1 states with one of the 21 possible outcomes of the r parity 
checks. This is best illustrated by an example. 
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TABLE 3-1 Example of assignment of parity 
values to states 


State 

Erroneous parity check(s) 

Syndrome 

No errors 

None 

000 

Bit 0 (po) error 

PO 

001 

Bit 1 (pi) error 

Pi 

010 

Bit 2 (pp) error 

P2 

100 

Bit 3 ho) error 

PO'Pl 

Oil 

Bit 4 hj) error 

P0'P2 

101 

Bit 5 («2> error 

PI-P2 

110 

Bit 6 (173) error 

P0'PlrP2 

111 



FIGURE 3.4 The assignment of parity bits in Table 3-1. 


■ EXAMPLE 

Suppose we have d — 4 data bits, aywpuo- From Equation 3.1 we know that 
r = 3 is the minimum number of parity bits, which we denote by pip \ po ■ There 
are 4 + 3 + 1 = 8 states that the codeword can be in. The complete 7-bit code¬ 
word is «3«2 fl l fl oP2PlPO/ ie., the least significant bit positions 0,1, and 2 are par¬ 
ity bits and the others are data bits. Table 3-1 shows one possible assignment 
of parity check outcomes to the states, which is also illustrated in Figure 3.4. 
The assignment of no errors in the parity checks to the "no errors" state is ob¬ 
vious, as is the assignment for the next three states for which only one parity 
check is erroneous. The assignment of the bottom four states (corresponding 
to an error in a data bit) to the remaining four outcomes of the parity checks 
can be done in 4! ways. One of these is shown in Table 3-1 and Figure 3.4. 
For example, if the two checks of po and pi (and only these) are in error, that 
indicates a problem with bit position 4, which is a.\. 

A parity bit will cover all bit positions whose error is indicated by the corre¬ 
sponding parity check. Thus, po covers positions 0,3,4, and 6 (see Figure 3.4), 
i.e., po = flo©fll©«3 - Similarly, pi =flo©«2©«3 and p 2 = «i ffifl2©«3- For exam¬ 
ple, for the data bits a^aiao = 1100, the generated parity bits are P 2 P 1 P 0 = 001. 
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Suppose now that the complete codeword 1100001 experiences a single-bit 
error and becomes 1000001. We recalculate the three parity bits, obtaining 
P 2 P ipo = 111. Calculating the difference between the new generated values of 
the parity bits and their previous values (by performing a bitwise XOR opera¬ 
tion) yields 110. This difference, which is called the syndrome, indicates which 
parity checks are in error. The syndrome 110 indicates, based on Table 3-1, that 
bit 0-2 is in error and the correct data should be ayioii] — 1100. This code is 
called a (7,4) Hamming single error correcting (SEC) code. 

The syndrome (which is the result of the parity checks) can be calculated di¬ 
rectly from the bits fl3fl2«i«oP2PiPo i n one step. This is best represented by the 
following matrix operation in which all the additions are modulo-2. The ma¬ 
trix below is called the parity check matrix: 



For all the syndromes generated this way (see Table 3-1), except for Oil and 
100 , we can subtract 1 from the calculated syndrome to obtain the index of 
the bit in error. We can modify the assignment of states to the parity check 
outcomes so that the calculated syndrome will for all cases (except, clearly, the 
no-error case) provide the index of the bit in error after subtracting 1. For the 
example above, the order p 2 d)PiP() will provide the desired syndromes. 
If we modify the bit position indices so that they start with 1 and thus avoid 
the need to subtract a 1, we obtain the following parity check matrix: 

11110 0 0 
110 0 110 
10 10 10 1 
7 6 5 4 3 2 1 
(i 3 u 2 (ij /t a 0 p i p o 

Note that now the bit position indices of all the parity bits are powers of 2 (i.e., 
1,2, and 4), and the binary representations of these indices form the parity 
check matrix. ■ 


If 2 r > d + r + 1, we need to select d + r + 1 out of the 21 binary combinations 
to serve as syndromes. In such a case, it is best to avoid those combinations that 
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a 2 a 1 a 0 p 2 p i p o 
0 1110 0 
10 10 10 
110001 

(b) 

FIGURE 3.5 Two possible parity check matrices for d = 3. 

include a large number of Is. This will result in a parity check matrix that includes 
fewer Is, leading to simpler circuits for the encoding and decoding operations. 
For example, for d = 3 we set r = 3, but only seven out of the eight 3-bit binary 
combinations are needed. Figure 3.5 shows two possible parity check matrices: (a) 
uses the combination 111 whereas (b) does not. As a result, the encoding circuit for 
the matrix in (a) will require a single XOR gate for generating p\ and f '>2 but two 
XOR gates for generating p ( ). In contrast, the encoding circuit for the matrix in (b) 
needs a single XOR gate for generating each parity bit. 

The code in Table 3-1 is capable of correcting a single-bit error but cannot detect 
a double-error. For example, if two errors occur in 1100001, yielding 1010001 (112 
and ti \ are erroneous), the resulting syndrome is 011 , indicating erroneously that 
bit ti[) should be corrected. One way to improve the error detection capabilities is 
to add an extra check bit that will serve as the parity bit of all the other data and 
check bits. The resulting code is called an (8,4) single-error correcting/double¬ 
error detecting (SEC/DED) Flamming code. The generation of the syndrome for 
this code is shown below. 

a 3 a 2 a,a 0 p p p p 

1111111 il \ a 3 
1 1 1 0 0 1 0 0 a 
110 10 0 10 
_1 0 1 1 0 0 0 1 J a ‘ 

a 0 

P 3 
P 2 
P 2 

_ p o 

As before, the last three bits of the syndrome indicate the bit in error to be cor¬ 
rected, as long as the first bit, S 3 , is equal to 1. Since P 3 is the parity bit of all the 
other data and check bits, a single-bit error changes the overall parity, and as a 
result, S 3 must be equal to 1. If S 3 is zero and any of the other syndrome bits is 
nonzero, a double or greater error is detected. For example, if one error occurs 
in 11001001 yielding 10001001 , the calculated syndrome is 1110 , indicating, as be¬ 
fore, that a 2 is in error. If, however, two errors occur, resulting in 10101001, the 



a 2 a,a u p 2 p i p u 
101100 1 
110 0 10 
1110 0 1 

(a) 
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calculated syndrome is 0011, indicating that an uncorrectable error has occurred. 
In general, an even number of errors is detectable whereas an odd (and larger 
than 1) number of errors is indistinguishable from a single-bit error, leading to an 
erroneous correction. 

Current memory circuits that have SEC/DED support (not all do) use either a 
(39,7) or a (72,8) Hamming code. Since errors in two or more physically adjacent 
memory cells are quite likely, the bits in a single memory word are often assigned 
to non-adjacent memory cells to reduce the probability of an uncorrectable double 
error in the same word. 

A disadvantage of the above SEC/DED Hamming code is that the calculation of 
the additional check bit, which is the parity bit of all other check and data bits, may 
slow down encoding and decoding. One way to avoid this penalty but still have 
the ability to detect double errors is to assign to the data and check bits only syn¬ 
dromes that include an odd number of Is. Note that in the original SEC Hamming 
code, each parity bit has a syndrome that includes a single 1. By restricting our¬ 
selves to the use of syndromes that include an odd number of Is (for any single-bit 
error), a double error will result in a syndrome with an even number of Is, indi¬ 
cating an error that cannot be corrected. A possible parity check matrix for such 
an (8,4) SEC/DED Hamming code is shown below. 

a 3 a 2 a, a u /> /> /> /> 

01111000 
10110100 
11010010 
_1 1100001 _ 

Limiting ourselves to odd syndromes implies that we use only 2' -1 out of the 2' 
possible combinations. This is equivalent to saying that we need an extra check bit 
beyond the minimum required for a SEC Hamming code, and the total number of 
check bits is the same as for the original SEC/DED Hamming code. 

If the number of data bits is very large, the probability of having an error that 
is not correctable by an SEC code increases. To reduce this probability, we may 
partition the D data bits into, say. D/d equal slices (of d bits each) and encode each 
slice separately using an appropriate (d + r, d) SEC Hamming code. This, however, 
will increase the overhead, r/d, imposed by the SEC code. We have therefore a 
tradeoff between the probability of an uncorrectable error and the overhead. Iff is 
the probability of a bit error and if bit errors occur independently of one another, 
the probability of more than one bit error in a field of d + r bits is given by 


<t>(d,r) — 1 — (1 -f) d+r ~(d + r)f( 1 


(d + r)(d +r - 1) 


f 


if/« 1 


2 


(3.2) 
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TABLE 3-2 The overhead versus probability of an un- 
correctable error tradeoff for an overlapping parity code 
with a total of D = 1024 data bits and a bit error probabil¬ 
ity of /= 10 -11 


d 

r 

Overhead rid 

V(D,d,r) 

2 

3 

1.5000 

0.5120E-16 

4 

3 

0.7500 

0.5376E-16 

8 

4 

0.5000 

0.8448E-16 

16 

5 

0.3125 

0.1344E-15 

32 

6 

0.1875 

0.2250E-15 

64 

7 

0.1094 

0.3976E-15 

128 

8 

0.0625 

0.7344E-15 

256 

9 

0.0352 

0.1399E-14 

512 

10 

0.0195 

0.2720E-14 

1024 

11 

0.0107 

0.5351 E-14 


The probability that there is an uncorrectable error in any one of the D/d slices is 
given by 


V{\D,d,r) = 1 - (1 - <P(d,r)) D/d 

& (D/d)0(d, r) if 0(d, r)<£ 1 (3.3) 

Some numerical results illustrating the tradeoff are provided in Table 3-2. 

3.1.2 Checksum 

Checksum is primarily used to detect errors in data transmission through com¬ 
munication channels. The basic idea is to add up the block of data that is being 
transmitted and to transmit this sum as well. The receiver then adds up the data it 
received and compares this sum with the checksum it received. If the two do not 
match, an error is indicated. 

There are several variations of checksums. Assume the data words are d bits 
long. In the single-precision version, the checksum is a modulo-2 l/ addition. In the 
double-precision version, it is a modulo-2 2d addition. Figure 3.6 shows an example 
of each. In general, the single-precision checksum catches fewer errors than the 
double-precision version, since we only keep the rightmost d bits of the sum. The 
residue checksum takes into account the carry out of the dth bit as an end-around 
carry (i.e., the carryout is added to the least significant bit of the checksum) and 
is therefore somewhat more reliable. The Honeywell checksum, by concatenating 
words together into pairs for the checksum calculation (performed modulo-2 2 ^), 
guards against errors happening in the same position. For example, consider the 
situation in Figure 3.7. Because the line carrying aj is stuck at 0, the receiver will 
find that the transmitted checksum and its own computed checksum match in the 
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0000 

0101 

mi 

0000 

0101 

1111 

0000 

0101 

mi 

00000101 

0010 

0010 

0010 

11110010 

0110 

00010110 

0111 

11110111 

(a) Single-precision 

(b) Double-precision 

(c) Residue 

(d) Honeywell 


FIGURE 3.6 Variations of checksum coding (boxed quantities are the computed 
checksums). 



(a) Circuit 


1000 

1011 

0000 


0000 

0011 

0000 


Transmitted Received 
(b) Single-precision 


10001011 

00001100 


00000011 

00000100 


mi 


0111 


10010111 


00010111 


Transmitted Received 
(c) Honeywell 


FIGURE 3.7 Honeywell versus single-precision checksum (boxed quantities indicate 
transmitted/received checksum). 


single-precision checksum. However, the Honeywell checksum, when computed 
on the received data, will differ from the received checksum and the error will be 
detected. All the checksum schemes allow error detection but not error location, 
and the entire block of data must be retransmitted if an error is detected. 

3.1.3 M-of-N Codes 

The M-of-N code is an example of a unidirectional error-detecting code. As the 
term implies, in unidirectional errors all the affected bits change in the same direc¬ 
tion, either from 0 to 1 or from 1 to 0 but not in both directions. 

In an M-of-N code, every N-bit codeword has exactly M bits that are 1, result¬ 
ing in (^) codewords. Any single-bit error will change the number of Is to either 
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TABLE 3-3 The 2-of-5 
code for decimal digits 


Digit 

Codeword 

0 

00011 

i 

00101 

2 

00110 

3 

01001 

4 

01010 

5 

01100 

6 

10001 

7 

10010 

8 

10100 

9 

11000 


M + 1 or M — 1 and will be detected. Unidirectional multiple errors would also be 
detected. A simple instance of an M-of-N code is the 2-of-5 code, which consists 
of 10 codewords and can serve to encode the decimal digits. An example of a 2- 
of-5 code is shown in Table 3-3. There are 10! different ways of assigning the 10 
codewords to the decimal digits. The assignment shown in the table preserves the 
binary order. The main advantage of M-of-N codes is their conceptual simplicity. 
However, encoding and decoding become relatively complex operations because 
such codes are, in general, nonseparable, unlike the parity and checksum codes. 

Still, separable M-of-N codes can be constructed. For example, an M-of-2M code 
can be constructed by adding M check bits to the given M data bits so that the 
resulting 2M-bit codeword has exactly M Is. Such codes are easy to encode and 
decode but have a greater overhead (100% or more) than do the nonseparable 
ones. For example, to encode the 10 decimal digits, we start with 4 bits per digit, 
leading to a 4-of-8 code, which has a much higher level of redundancy than does 
the 2-of-5 code. 

3.1.4 Berger Code 

The M-of-2M code for detecting unidirectional errors is a separable code but has a 
high level of information redundancy. A unidirectional error detecting code that is 
separable and has a much lower overhead is the Berger code. To encode, count the 
number of Is in the word, express this count in binary representation, complement 
it, and append this quantity to the data. For example, suppose we are encoding 
11101. There are four Is in it, which is 100 in binary. Complementing results in Oil 
and the codeword will be 11101011. 

The overhead of the Berger code can be computed as follows. If there are d data 
bits, then there can be at most d Is in it, which can take up to flog 2 (d + 1)1 bits to 
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TABLE 3-4 Berger code 
overhead 


d 

r 

Overhead 

8 

4 

0.5000 

15 

4 

0.2667 

16 

5 

0.3125 

31 

5 

0.1613 

32 

6 

0.1875 

63 

6 

0.0952 

64 

7 

0.1094 

127 

7 

0.0551 

128 

8 

0.0625 

255 

8 

0.0314 

256 

9 

0.0352 


count. The overhead per data bit is therefore given by 

rio g2 (d+i)i 

d 

This overhead is tabulated for some values of d in Table 3-4. If d — 2 k — 1 for an 
integer k, then the number of check bits, denoted by r, is r — k and the resulting 
code is called a maximum-length Berger code. For the unidirectional error detec¬ 
tion capability provided, the Berger code requires the smallest number of check 
bits out of all known separable codes. 

3.1.5 Cyclic Codes 

In cyclic codes, encoding of data consists of multiplying (modulo-2) the data word 
by a constant number, and the coded word is the product that results. Decoding 
is done by dividing by the same constant: if the remainder is nonzero, it indicates 
that an error has occurred. These codes are called cyclic because for every code¬ 
word a n -i,a n - 2 ,... ,flO/ its cyclic shift ■ ■ ■ ,«i is also a codeword. For 

example, the 5-bit code consisting of {00000,00011,00110,01100,11000,10001,00101, 
01010 ,10100,01001,10010,01111,11110,11101,11011,10111} is cyclic. 

Cyclic codes have been the focus of a great deal of research and are widely used 
in both data storage and communication. We will present only a small sampling 
of this work: the theory of cyclic codes rests on advanced algebra, which is out¬ 
side the scope of this book. Interested readers are directed to the ample coding 
literature (see the Further Reading section). 

Suppose k is the number of bits of data that we are seeking to encode. The 
encoded word of length n bits is obtained by multiplying the given k data bits by 
a number that is n — k + 1 bits long. 
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1110 
x 11 
1110 
1110 
TooTo 


FIGURE 3.8 Encoding the data word 1110. 

In cyclic coding theory, the multiplier is represented as a polynomial, called the 
generator polynomial. The Is and Os in the (n —k + l)-bit multiplier are treated as 
the coefficients of an (n — /c)-degree polynomial. For example, if the 5-bit multiplier 
is 11001, the generator polynomial is G(x) = 1 • X 4 +1 • X 3 + 0 • X 2 + 0 • X 1 +1 • X° = 
X 4 + X 3 + 1. A cyclic code using a generator polynomial of degree n — k and total 
number of encoded bits n is called an ( n,k ) cyclic code. An ( n,k ) cyclic code can 
detect all single errors and also all runs of adjacent bit errors, so long as these runs 
are shorter than n — k. These codes are therefore very useful in such applications as 
wireless communication, where the channels are frequently noisy and subject the 
transmission to bursts of interference that can result in runs of adjacent bit errors. 
For a polynomial of degree n — k to serve as a generator polynomial of an ( n,k) 
cyclic code, it must be a factor of X” — 1. The polynomial X 4 + X 3 + 1 is a factor 
of X 15 — 1 and can thus serve as a generator polynomial for a (15,11) cyclic code. 
Another factor of X 15 — 1 is X 4 + X + 1, which can generate another (15,11) cyclic 
code. The polynomial X 15 — 1 has five prime factors, namely, 

X 15 -l = (X + 1)(X 2 + X+1)(X 4 + X+1)(X 4 + X 3 + 1)(X 4 + X 3 + X 2 + X + 1) 

Any one of these five factors and any product of two (or more) of these factors 
can serve as a generating polynomial for a cyclic code. For example, the product 
of the first two factors is (X + 1)(X 2 + X + 1) = X 3 + 1, and it can generate a (15,12) 
cyclic code. When multiplying X +1 and X 2 + X +1, note that all additions are per¬ 
formed modulo-2. Also note that subtraction in modulo-2 arithmetic is identical to 
addition, and thus, X 15 — 1 is identical to X 15 + 1. 

The 5-bit cyclic code mentioned at the beginning of this section has the gener¬ 
ator polynomial X + 1 satisfying X 5 — 1 = (X + 1)(X 4 + X 3 + X 2 + X + 1) and is 
a (5,4) cyclic code. We can verify that X + 1 is the generator polynomial for the 
above (5,4) cyclic code by multiplying all 4-bit data words (0000 through 1111) by 
X + 1 or 11 in binary. For example, the codeword corresponding to the data word 
0110 is 01010, as we now show. The data word 0110 can be represented as X 2 + X, 
and when multiplied by X + 1, results in X 3 + X 2 + X 2 + X = X 3 + X, which rep¬ 
resents the 5-bit codeword 01010. The multiplication by the generator polynomial 
can also be performed directly in binary arithmetic rather than using polynomials. 
For example, the codeword corresponding to the data word 1110 is obtained by 
multiplying 1110 by 11 in modulo-2 arithmetic as shown in Figure 3.8. Note that 
this cyclic code is not separable. The data bits and check bits within the codeword 
10010 are not separable. 
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FIGURE 3.9 Encoding circuit for the (15,11) cyclic code with the generating polynomial 
X 4 + X 3 + l. 


10001100101 
x 11001 


100011 
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0000000 
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000 

00000000 
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00 

100011001 

0 

1 

1000110010 
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1100001000 

T 

1101 


FIGURE 3.10 Example of modulo-2 multiplication for encoding the 11-bit input 
10001100101 . 

One of the most significant reasons for the popularity of cyclic codes is the fact 
that multiplication and division by the generator polynomial can be implemented 
in hardware using simple shift registers and XOR gates. Such a simple implemen¬ 
tation allows fast encoding and decoding. Let us start with an example: consider 
the generator polynomial X 4 + X 3 +1 (corresponding, as we have seen, to the mul¬ 
tiplier 11001). Consider the circuit shown in Figure 3.9, where the square boxes are 
delay elements, which hold their input for one clock cycle. 

The reader will find that this circuit does indeed multiply (modulo-2) serial 
inputs by 11001. To see why this should be, consider the multiplication shown in 
Figure 3.10. Focus on the boxed column. It shows how the fifth bit of the product 
is the modulo -2 sum of the corresponding bits of the multiplicand shifted 0 times, 
3 times, and 4 times. If the multiplicand is fed in serially, starting with the least 
significant bit and we add the multiplicand shifted as shown above, we arrive at 
the product. It is precisely this shifting that is done by the delay elements of the 
circuit. Table 3-5 illustrates the operation of the encoding circuit in which (3 is the 
input to the O 3 delay element. 

We now consider the process of decoding, which is done through division by 
the generator polynomial. Let us first illustrate the decoding process through di¬ 
vision by the constant 11001 as shown in Figure 3.11a. The final remainder is zero, 
indicating that no error has been detected. If a single error occurs and we receive 
110000100111101 (the boldface 1 is the bit in error), the division will generate a 
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TABLE 3-5 The operation of the encoder in 
Figure 3.9 for the example in Figure 3.10 


Shift 

clock 

Input 

data 

o 4 

h 

O3 O2 Oi 

Encoded 

output 

1 

1 

0 

1 

000 

1 

2 

0 

1 

1 

100 

0 

3 

1 

0 

1 

110 

1 

4 

0 

1 

1 

111 

1 

5 

0 

0 

0 

111 

1 

6 

1 

0 

1 

Oil 

0 

7 

1 

1 

0 

101 

0 

8 

0 

1 

1 

010 

0 

9 

0 

0 

0 

101 

1 

10 

0 

0 

0 

010 

0 

11 

1 

0 

1 

001 

0 

12 

0 

1 

1 

100 

0 

13 
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0 

0 

110 

0 

14 

0 

0 

0 

Oil 

1 

15 

0 

0 

0 

001 

1 


110000100011101 : 11001 = 10001100101 
11001 
10100 
11001 
11010 
11001 
mu 
11001 
11001 
11001 
00000 

(a) Error free 


110000100111101 : 11001 = 10001100110 
11001 
10100 
11001 
11011 
11001 
10111 
11001 
11100 
11001 
01011 

(b) A single-bit error (in boldface) 


FIGURE 3.11 Decoding through division. 


nonzero remainder as shown in Figure 3.11b. To show that every single error can 
be detected, note that a single error in bit position i can be represented by X 1 , and 
the received codeword that includes such an error can be written as D(X)G(X) + X 1 , 
where D(X) is the original data word and G(X) is the generator polynomial. If 
G(X) has at least two terms, it does not divide X 1 , and consequently, dividing 
D(X)G(X) + X 1 by G(X) will generate a nonzero remainder. 

The above (15,11) cyclic code can be shown to have a Hamming distance of 
3, thus allowing the detection of all double-bit errors irrespective of their bit po¬ 
sitions. The situation is different when three-bit errors occur. Suppose first that 
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110000111010101:11001 = 10001101101 
11001 
10111 
11001 
11100 
11001 
10110 
11001 
11111 
11001 
ITOOl 
11001 
00000 

(a) Three nonadjacent errors (in boldface) 


110000011011101:11001 = 10001110011 
11001 
10011 
11001 
10100 
11001 
11011 
11001 
10110 
11001 
11111 
11001 
00110 

(b) Three adjacent errors (in boldface) 


FIGURE 3.12 Decoding through division with 3-bit errors. 


the 3-bit errors occur in nonadjacent bit positions, producing, for example, 11000 
OHIO 10101 instead of 11000 01000 11101. Repeating the above division for this 
codeword results in the quotient and remainder shown in Figure 3.12a. The final 
remainder is zero, and consequently, the 3-bit errors were not detected, although 
the final result is erroneous. If, however, the 3-bit errors are adjacent, e.g., 11000 
00110 11101, we obtain the quotient and remainder shown in Figure 3.12b. The 
nonzero remainder indicates an error. 

To implement a divider circuit, we should realize that division can be achieved 
through multiplication in the feedback loop. We illustrate this through the follow¬ 
ing example. 


■ EXAMPLE 

Let the encoded word be denoted by the polynomial £(X), and use the previ¬ 
ously defined notation of G(X) and D(X) for the generator polynomial and the 
original data word, respectively. If no bit errors exist, we will receive E(X) and 
can calculate D(X) from D(X) = and the remainder will be zero. In such a 
case, we can rewrite the division as 

E(X) = D(X) • G(X) = D(X){X 4 + X 3 + 1} 

= D(X){X 4 + X 3 } +D(X) 
thus D(X) = E(X) - D(X) {X 4 + X 3 } 

= £(X) + D(X) {X 4 + X 3 } 

(because addition = subtraction in modulo-2) 
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FIGURE 3.13 Decoding circuit for the (15,11) cyclic code with the generating polynomial 

X 4 +X 3 + l. 


TABLE 3-6 The operation of the decoder in Figure 3.13 for the input 
110000100011101 


Shift 

clock 

Encoded 

input 

<4 

04 

h 

O 3 O 2 Oi 

Decoded 

output 

1 

1 

1 

0 
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000 
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2 
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0 

1 

1 

100 

0 

3 

1 

1 

0 

1 

110 

1 

4 

1 

0 

1 

1 

111 

0 

5 

1 

0 

0 

0 

111 

0 

6 

0 

1 

0 

1 

Oil 

1 

7 

0 

1 

1 

0 

101 

1 

8 

0 

0 

1 

1 

010 

0 

9 

1 

0 

0 

0 

101 

0 

10 

0 

0 

0 

0 

010 

0 

11 

0 

1 

0 

1 

001 

1 

12 

0 

0 

1 

1 

100 

0 

13 

0 

0 

0 

0 

110 

0 

14 

1 

0 

0 

0 

Oil 

0 

15 

1 

0 

0 

0 

001 

0 


With this last expression, we can construct the feedback circuit for division 
(see Figure 3.13). We start with all delay elements holding 0, produce first the 
seven quotient bits that constitute the data bits, and then the four remainder 
bits. If these remainder bits are nonzero, we know that an error has occurred. 
Table 3-6 illustrates the decode operation in which z '3 is the input to the O 3 
delay element. The reader can verify that any error in the received sequence 
E(X) will result in a nonzero remainder. ■ 


In many data transmission applications, there is a need to make sure that all 
burst errors of length 16 bits or less will be detected. Therefore, cyclic codes of the 
type (16 +k,k) are used. The generating polynomial of degree 16 should be selected 
so that the maximum number of data bits is sufficiently large, allowing the use of 
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the same code (and the same encoding and decoding circuits) for data blocks of 
many different sizes. Two generating polynomials of degree 16 are commonly used 
for this purpose. These are the CRC-16 polynomial (where CRC stands for Cyclic 
Redundancy Check), 

G(X) = (X + 1)(X 15 + X + 1) = X 16 + X 15 + X 2 + 1 
and the CRC-CCITT polynomial, 

G(X) = (X + 1) (X 15 + X 14 + X 13 + X 12 + X 4 + X 3 + X 2 + X + 1) 

= X 16 + X 12 + X 5 + 1 

In both cases, the degree-16 polynomial divides X" — 1 for n — 2 15 — 1 (but not 
for any smaller value of n) and thus can be used for blocks of data of size up to 
2 15 — 1 = 32,767 bits. Note that shorter blocks can still use the same cyclic code. 
Such blocks can be viewed as blocks of size 32,767 bits with a sufficient number of 
leading Os that can be ignored in the encoding or decoding operations. Also note 
that both CRC polynomials have only four nonzero coefficients, greatly simplify¬ 
ing the design of the encoding and decoding circuits. 

The CRC-32 code shown below is widely used for data transfers over the Inter¬ 
net: 

G(X) = X 32 + X 26 + X 23 + X 22 + X 16 + X 12 + X 11 
+ X 10 + X 8 + X 7 + X 5 + X 4 + X 2 + X + 1 

allowing the detection of burst errors consisting of up to 32 bits for blocks of data 
of size up to n = 2 32 — 1 bits. 

For data transmissions of long blocks, it is more efficient to employ a sepa¬ 
rable encoding that will allow the received data to be used immediately with¬ 
out having to wait for all the bits of the codeword to be received and de¬ 
coded. A separable cyclic code will allow performing the error detection inde¬ 
pendently of the data processing itself. Fortunately, there is a simple way to 
generate a separable ( n,k ) cyclic code. Instead of encoding the given data word 
D(X) = d k _ iTC -1 + d k _ 2 X k ~ 2 + ■ ■ ■ + do by multiplying it by the generator poly¬ 
nomial G(X) of degree n — k, we first append (n — k) zeroes to D(X) and obtain 
D(X) = 4-iX"” 1 + d k _ 2 X n ~ 2 + • • • + d 0 X n ~ k . We then divide D(X) by G(X), yield¬ 
ing 

D(X) = Q(X)G(X) + R(X) 

where R(X) is a polynomial of degree smaller than n — k. Finally, we form the 
codeword C(X) = D(X) — R(X), which will be transmitted. This n-bit codeword has 
G(X) as a factor, and consequently, if we divide C(X) by G(X), a nonzero remainder 
will indicate that errors have occurred. In this encoding, D(X) and R(X) have no 
terms in common, and thus, the first k bits in C(X) = D(X) — R(X) = D(X) + R(X) 
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are the original data bits while the remaining n — k are the check bits, making the 
encoding separable. 


■ EXAMPLE 

We illustrate the procedure described above through the (5,4) cyclic code that 
uses the same generator polynomial X+1 as before. For the data word 0110 we 
obtain D(X) = X 3 + X 2 . Dividing D(X) by X +1 yields Q(X) = X 2 and R(X) = 0. 
Thus, the corresponding codeword is X 3 + X 2 , or in binary 01100, where the 
first four bits are the data bits and the last one the check bit. Similarly, for the 
data word 1110, we obtain 

D(X) = X 4 + X 3 + X 2 = (X 3 + X + 1)(X + 1) + 1 

yielding the codeword 11101. The reader can verify that the same 16 code¬ 
words as before are generated, {00000,00011,00110,01100,11000,10001,00101, 
01010,10100,01001,10010,01111,11110,11101,11011,10111}, but the correspon¬ 
dence between the data words and the codewords has changed. ■ 


3.1.6 Arithmetic Codes 

Arithmetic error codes are those codes that are preserved under a set of arithmetic 
operations. This property allows us to detect errors which may occur during the 
execution of an arithmetic operation in the defined set. Such concurrent error de¬ 
tection can always be attained by duplicating the arithmetic unit, but duplication 
is often too costly to be practical. 

We say that a code is preserved under an arithmetic operation * if for any two 
operands X and Y, and the corresponding encoded entities X' and Y', there is an 
operation ® for the encoded operands satisfying 

X'® Y' = (X ★ Y)' (3.4) 

This implies that the result of the arithmetic operation ®, when applied to the en¬ 
coded operands X' and Y', will yield the same result as encoding the outcome 
of applying the original operation ★ to the original operands X and Y. Conse¬ 
quently, the result of the arithmetic operation will be encoded in the same code 
as the operands. 

We expect arithmetic codes to be able to detect all single-bit faults. Note, how¬ 
ever, that a single-bit error in an operand or an intermediate result may well cause 
a multiple-bit error in the final result. For example, when adding two binary num¬ 
bers, if stage i of the adder is faulty, all the remaining (n — i) higher order digits 
may become erroneous. 
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There are two classes of arithmetic codes: separable and nonseparable. The sim¬ 
plest nonseparable codes are the AN codes, formed by multiplying the operands 
by a constant A. In other words, X' in Equation 3.4 is A ■ X, and the operations ® 
and ★ are identical for addition and subtraction. For example, if A — 3, we multiply 
each operand by 3 (obtained as 2X + X) and check the result of an add or subtract 
operation to see whether it is an integer multiple of 3. All error magnitudes that 
are multiples of A are undetectable. Therefore, we should not select a value of A 
that is a power of the radix 2 (the base of the number system). An odd value of 
A will detect every single digit fault, because such an error has a magnitude of 2 ! . 
Setting A —3 yields the least expensive AN code that still enables the detection of 
all single errors. 

For example, the number Olltb = 610 is represented in the AN code with A — 3 
by OIOOIO 2 = 18io- A fault in bit position 3 may result in the erroneous number 
OIIOIO 2 = 26 i(j. This error is easily detectable, since 26 is not a multiple of 3. 

The simplest separable codes are the residue code and the inverse residue code. 
In each of these, we attach a separable check symbol C(X) to every operand X. For 
the residue code, C(X) = X mod A = \X\,\, where A is called the check modulus. 
For the inverse residue code, C(X) = A — (X mod A). For both separable codes. 
Equation 3.4 is replaced by 


C(X)® C(Y) = C(X*Y) (3.5) 

This equality clearly holds for addition and multiplication because the following 
equations apply: 


|X+YU = ||XU + |YUL 

|X.YU = ||XU-|YUh (3-6) 


■ EXAMPLE 

If A = 3, X = 7, and Y = 5, the corresponding residues are X|/i = 1 and \Y\a — 
2. When adding the two operands, we obtain \7 + 5|3 = 0 = || 7|3 + | 5 | 3|3 = |1 + 
2|3 = 0. When multiplying the two operands, we get |7 • 5|3 = 2 = 1 17|3 • | 5 | 3|3 = 

1 • 2 3 - 2 . ■ 


For division, the equation X — S — Q ■ D is satisfied, where X is the dividend, D 
the divisor, Q the quotient, and S the remainder. The corresponding residue check 
is therefore 


||XU-|SU| A = ||QU-|DUU 
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■ EXAMPLE 

If A = 3, X = 7, and D = 5, the results are Q = 1 and S — 2. The corresponding 
residue check is || 7|3 — | 2 | 313 = || 5|3 • 11 1313 = 2. The subtraction in the left-hand- 
side term is done by adding the complement to the modulus 3, i.e., |1 — 2|3 = 
11 + 13 — 21313=11 + 113 = 2. ■ 


A residue code with A as a check modulus has the same undetectable error 
magnitudes as the corresponding AN code. For example, if A — 3, only errors that 
modify the result by some multiple of 3 will go undetected, and consequently, 
single-bit errors are always detectable. In addition, the checking algorithms for 
the AN code and the residue code are the same: in both we have to compute the 
residue of the result modulo-A. Even the increase in word length, | log 2 A|, is the 
same for both codes. The most important difference is due to the property of sepa¬ 
rability. The arithmetic unit for the check symbol C(X) in the residue code is com¬ 
pletely separate from the main unit operating on X, whereas only a single unit (of 
a higher complexity) exists in the case of the AN code. An adder with a residue 
code is depicted in Figure 3.14. In the error detection block shown in this figure, 
the residue modulo-A of the X + Y input is calculated and compared to the result 
of the mod A adder. A mismatch indicates an error. 

The AN and residue codes with A — 3 are the simplest examples of a class of 
arithmetic (separable and nonseparable) codes that use a value of A of the form 
A = 2 a — 1, for some integer a. This choice simplifies the calculation of the remain¬ 
der when dividing by A (which is needed for the checking algorithm), and this is 
why such codes are called low-cost arithmetic codes. The calculation of the remain¬ 
der when dividing by 2" — 1 is simple, because the equation 

= |z/lr-i, r = 2 a (3.7) 


allows the use of modulo-( 2 " — 1 ) summation of the groups of size a bits that com¬ 
pose the number (each group has a value 0 < z, < 2" — 1 , see below). 



Error Indication 


FIGURE 3.14 An adder with a separate residue check. 





3.1 Coding 


77 


■ EXAMPLE 

To calculate the remainder when dividing the number X = 11110101011 by 
A — 7 = 2 3 — 1, we partition X into groups of size 3, starting with the least 
significant bit. This yields X = (z 3 ,Z 2 ,Zi,Zo) = (11,110,101,011). We then add 
these groups modulo-7; i.e., we "cast out" 7s and add the end-around-carry 
whenever necessary. A carry-out has a weight of 8 , and because | 8|7 = 1, we 
must add an end-around-carry whenever there is a carry-out as illustrated 
below. 


+ 

11 

110 

Z3 

Z 2 

1 

001 


+ 

1 

end-around carry 


010 


+ 

101 

Zl 


111 


+ 

Oil 

Zo 

1 

010 



1 

end-around carry 

+ 

Oil 



The residue modulo-7 of X is 3, which is the correct remainder of X = 1963io 
when divided by 7. ■ 


Both separable and nonseparable codes are preserved when we perform arith¬ 
metic operations on unsigned operands. If we wish to include signed operands as 
well, we must require that the code be complementable with respect to R, where 
R is either 2" or 2" — 1 and n is the number of bits in the encoded operand. The 
selected R will determine whether two's complement (for which R — 2”) or one's 
complement (for which R = 2" — 1) arithmetic will be employed. For the AN code, 
R — AX must be divisible by A, and thus A must be a factor of R. If we insist on A 
being odd, it excludes the choice R — 2", and only one's complement can be used. 


■ EXAMPLE 

For n = 4, R is equal to 2" — 1 = 15 for one's complement and is divisible by 
A for the AN code with A = 3. The number X = 0110 is represented by 3X = 
010010, and its one's complement 101101 (= 45io) is divisible by 3. However, 
the two's complement of 3X is 101110 (= 46io) and is not divisible by 3. If n — 5, 
then for one's complement R is equal to 31, which is not divisible by A. The 
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number X = 00110 is represented by 3X = 0010010, and its one's complement 
is 1101101 (= 109io), which is not divisible by 3. ■ 


For the residue code with the check modulus A, the equation A — |X|^ = \R — X\a 
has to be satisfied. This implies that R must be an integer multiple of A, again 
allowing only one's complement arithmetic to be used. However, we may modify 
the procedure so that two's complement (with R — 2”) can also be employed: 

|2" - X| A = |2 ,! — 1 — X + 1 \ A = \2 n - 1 - X\ A + |1U (3.8) 

We therefore need to add a correction term 11 \a to the residue code when forming 
the two's complement. Note that A must still be a factor of 2" — 1. 


■ EXAMPLE 

For the residue code with A = 7 and n — 6 , R — 2 6 = 64 for two's complement 
and R — 1 = 63 is divisible by 7. The number OOIOIO 2 = IO 10 has the residue 
3 modulo-7. The two's complement of 001010 is 110110. The complement of 
| 3|7 is | 4 | 7 , and adding the correction term 11 17 yields 5, which is the correct 
residue modulo-7 of 110110 (= 54io). ■ 


A similar correction is needed when we add operands represented in two's 
complement and a carry-out (of weight 2”) is generated in the main adder. Such a 
carry-out is discarded according to the rules of two's complement arithmetic. To 
compensate for this, we need to subtract \2 h \a from the residue check. Since A is a 
factor of ( 2 " — 1 ), the term \ 2 "\a is equal to |1|a- 


■ EXAMPLE 

If we add to X = 110110 (in two's complement) the number Y = 001101, a 
carry-out is generated and discarded. We must therefore subtract the correc¬ 
tion term |2 6 \y = 111 7 from the residue check with the modulus A = 7, obtaining 

110110=X 101=|X | 7 

+ 001101=Y + 110=| Y | 7 

1 000011 1 “on 

1 end-around carry 

TOO 

— 1 correction term 

“on 

where 3 is clearly the correct residue of the result 000011 modulo-7. ■ 
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The above modifications result in an interdependence between the main arith¬ 
metic unit and the check unit that operates on the residues. Such an interdepen¬ 
dence may cause a situation in which an error from the main unit propagates to 
the check unit and the effect of the fault is masked. However, it has been shown 
that the occurrence of a single-bit error is always detectable. 

Error correction can be achieved by using two or more residue checks. The sim¬ 
plest case is the bi-residue code, which consists of two residue checks A\ and A 2 . 
If n is the number of bits in the operand, select a and b such that n is the least 
common multiple of a, b. If A\ — 2 a — 1 and A 2 = 2 1 ’ — 1 are two low-cost residue 
checks, then any single-bit error can be corrected. 

3.2 Resilient Disk Systems 

An excellent example of employing information redundancy through coding at a 
higher level than individual data words is the RAID structure. RAID stands for 
Redundant Arrays of Independent (or Inexpensive) Disks. We describe next five 
RAID structures. 

3.2.1 RAID Level 1 

RAIDl consists of mirrored disks. In place of one disk, there are two disks, each 
being a copy of the other. If one disk fails, the other can continue to serve access 
requests. If both disks are working, RAIDl can speed up read accesses by dividing 
them among the two disks. Write accesses are, however, slowed down, because 
both disks must finish the update before the operation can complete. 

Let us assume that the disks fail independently, each at a constant rate X, and 
that the time to repair each is exponentially distributed with mean 1 //x. We will 
now compute the reliability and availability of a RAIDl system. 

To compute the reliability, we set up a three-state Markov chain as shown in 
Figure 3.15 (Markov chains are explained in Chapter 2). The state of the system 
is the number of disks that are functional: it can vary between 0 (failed system) 
and 2 (both disks up). The unreliability at time t is the probability of being in the 
failed state, Po(f). The differential equations associated with this Markov chain are 
as follows: 

^j^ = -2 XP 2 (t) + nP 1 (t) 

d M ) = -(;. + m )P 1 (0 + 2aP 2 (0 
Po(f) = l-Pl(f)-P 2 (f) 

Solving these simultaneous differential equations with the initial conditions 
P 2 (0) = 1; Po(0) = Pi(0) = 0, we can obtain the probability that the disk system 
fails sometime before t. The expressions for the state probabilities are rather com¬ 
plex and not very illuminating. We will make use of an approximation, whereby 
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21 


X 





FIGURE 3.15 Markov chain for RAID1 reliability calculation. 

we compute the Mean Time to Data Loss (MTTDL), and then use the fact that 
/x X (the repair rate is much greater than the failure rate). 

The MTTDL is computed in the following way. State 0 will be entered if the 
system enters state 1 and then makes a transition to state 0. If we start in state 2 
at time 0, the mean time before state 1 is entered is 1/21. The mean time spent 
staying in state 1 is 1/(1 + /x). Following this, the system can either go back to 
state 2, which it does with probability q — /jl / (n + X) or to state 0, which it does 
with probability p = l/(p + 1). The probability that n visits are made to state 1 
before the system transits to state 0 is clearly because we would have to 

make n — 1 transitions from 1 to 2, followed by a transition from 1 to 0. The mean 
time to enter state 0 in this case is given by 

rji , ^ ( 1 , 1 \ 31 + /X 

UV ' V21 l + /x/ 21(1+ /x) 

Hence, 

OO 

MTTDL = ^y-Vr 2 ^o(w) 

n =1 
oo 

= E<-V^o(D 

n =1 

= r 2 -o(i)/p 

31 /x 
= 21 2 

If p 1, we can approximate the transition into state 0 by regarding the aggregate 
of states 1 and 2 as a single state, from which there is a transition of rate 1 / MTTDL 
to state 0. Hence, the reliability can be approximated by the function 

R(t )« e“ f/MTTDL (3.9) 

Figure 3.16 shows the unreliability of the system (probability of data loss) over 
time for a variety of mean disk lifetimes and mean disk repair times. It is worth 
noting the substantial impact of the mean repair time on the probability of data 
loss. 
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Time (in years) 

Curve labels indicate mean lifetime of a single disk 


(a) Impact of mean disk lifetime 



Cun’e labels indicate mean repair times 
(b) Impact of mean disk repair time 


FIGURE 3.16 Unreliability of RAID1 system. 

A calculation of the long-term availability of the disk system can be done based 
on a Markov chain identical to that shown in Figure 2.16, yielding 

A= b(h + 2X) 

(X + /x ) 2 


3.2.2 RAID Level 2 

Level 2 RAID consists of a bank of data disks in parallel with Hamming-coded 
disks. Suppose there are d data disks and c code disks. Then, we can think of the 
zth bit of each disk as bits of a (c + d)-bit word. Based on the theory of Hamming 
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codes, we know that we must have 2 C ^ c + d + 1 in order to permit the correction 
of one bit per word. 

We will not spend more time on RAID2 because other RAID designs impose 
much less overhead. 

3.2.3 RAID Level 3 

RAID3 is a modification of RAID2 and arises from the observation that each disk 
has error-correction coding per sector. Hence, if a sector is bad, we can identify it 
as such. RAID3 consists of a bank of d data disks together with one parity disk. 
The data are bit-interleaved across the data disks, and the zth position of the parity 
disk contains the parity bit associated with the bits in the zth position of each of 
the data disks. An example of a five-disk RAID3 system is shown in Figure 3.17. 

For error-detection and error-correction purposes, we can regard the zth bit of 
each disk as forming a (d + l)-bit word, consisting of d data and 1 parity bits. Sup¬ 
pose one such word has an incorrect bit in the jth bit position. The error-correcting 
code for that sector in the / tTi disk will indicate a failure, thus locating the fault. 
Once we have located the fault, the remaining bits can be used to restore the faulty 
bit. 

For example, let the word be 01101, where 0110 are the data bits and 1 is the par¬ 
ity bit. If even parity is being used; we know that a bit is in error. If the fourth disk 
(disk 3 in the figure) indicates an error in the relevant sector and the other disks 
show no such errors, we know that the word should be 01111, and the correction 
can be made appropriately. 

The Markov chains for the reliability and availability of this system are almost 
identical to those used in RAID1. In RAID1, we had two disks per group; here, we 
have d + 1. In both cases, the system fails (we have data loss) if two or more disks 
fail. Hence, the Markov chain for computing reliability is as shown in Figure 3.18. 
The analysis of this chain is similar to that of RAID1: the mean time to data loss 



FIGURE 3.17 A RAID3 system with 4 data disks and an even-parity disk. 
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FIGURE 3.18 Markov chain for RAID3 reliability calculation. 



Time (in years) 

Mean disk lifetime = 500,000 hours 


FIGURE 3.19 Unreliability of RAID3 system. 


for this group is 


MTTDL = 


(2d + l)k + /x 
d(d + l)k 2 


and the reliability is given approximately by 


R(t)«e- f/MTTDL 


( 3 . 10 ) 


( 3 . 11 ) 


Figure 3.19 shows some numerical results for various values of d. The case d = 1 
is identical to the RAID1 system. The reliability drops as d increases, as is to be 
expected. 


3.2.4 RAID Level 4 

RAID4 is similar to RAID3, except that the unit of interleaving is not a single bit 
but a block of arbitrary size, called a stripe. An example of a RAID4 system with 
four data disks and a parity disk is shown in Figure 3.20. The advantage of RAID4 
over RAID3 is that a small read operation may be contained in just a single data 
disk, rather than interleaved over all of them. As a result, small read operations 
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FIGURE 3.20 A RAID4 system with four data disks and a parity disk (each rectangle in 
the figure contains a block (stripe) of data). 


are faster in RAID4 than in RAID3. A similar remark applies to small write oper¬ 
ations: in such an operation, both the affected data disk and the parity disk must 
be updated. The updating of the parity is quite simple: the parity bit toggles if 
the corresponding data bit that is being written is different from the one being 
overwritten. 

The reliability model for RAID4 is identical to that of RAID3. 


3.2.5 RAID Level 5 

This is a modification of the RAID4 structure and arises from the observation that 
the parity disk can sometimes be the system bottleneck: in RAID4, the parity disk 
is accessed in each write operation. To get around this problem, we can simply 
interleave the parity blocks among the disks. In other words, we no longer have a 
disk dedicated to carrying parity bits. Every disk has some data blocks and some 
parity blocks. An example of a five-disk RAID5 system is shown in Figure 3.21. 

The reliability model for RAID5 is obviously the same as for RAID4: it is only 
the performance model that is different. 

3.2.6 Modeling Correlated Failures 

In the analysis we have presented so far, we have assumed that the disks fail inde¬ 
pendently of one another. In this section, we will consider the impact of correlated 
failures. 

Correlated failures arise because power supply and control are typically shared 
among multiple disks. Disk systems are usually made up of strings. Each string 
consists of disks that are housed in one enclosure, and they share power supply, 
cabling, cooling, and a controller. If any of these items fails, the entire string can 
fail. 
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FIGURE 3.21 Distributed parity blocks in a five-disk RAID5 system. 

Let A. s tr be the rate of failure of the support elements (power, cabling, cooling, 
control) of a string. If a RAID group is controlled by a single string, then the ag¬ 
gregate failure rate of the group is given by 


A total — *Lndep "b ^str (3.12) 

where A iru j C p is approximately the inverse of the MTTDL, assuming independent 
disk failures. If the disk repair rate is much greater than the disk failure rate, data 
loss due to independent disk failures can be well modeled by a Poisson process. 
The sum of two independent Poisson processes is itself a Poisson process: we can 
therefore regard the aggregate failure process as Poisson with rate A to tal ■ The relia¬ 
bility is therefore given by 

iWO = e“ w (3.13) 

The dramatic impact of string failures in a RAID1 system is shown in Figure 3.22. 
(The impact for RAID3 and higher levels is similar). Figures of 150,000 hours for 
the mean string lifetime have been quoted in the literature, and at least one manu¬ 
facturer claims mean disk lifetimes of 1,000,000 hours. Grouping together an entire 
RAID array as a single string therefore increases the unreliability by several orders 
of magnitude. 

To get around this, one can have an orthogonal arrangement of strings and 
RAID groups, as depicted in Figure 3.23. In such a case, the failure of a string 
affects only one disk in each RAID group. Because each RAID can tolerate the 
failure of up to one disk, this reduces the impact of string failures. 

The orthogonal system can be modeled approximately as follows. Every data 
loss is caused by a sequence of events. If this sequence started with a single disk 
failure or by a string failure, we say the failure is triggered by an individual or 
string failure, respectively. 

Since both string and disk failure rates are very low, we can without signifi¬ 
cant error, model separately failures triggered by individual and string failures. 
We will find the (approximate) failure rate due to each. Adding these two fail- 
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Time (in years) 

Curve labels indicate mean string lifetime 


FIGURE 3.22 Impact of string failure rate on RAID1 system. 
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FIGURE 3.23 Orthogonal arrangement of strings and RAID groups (d = 4). 


ure rates will give us the approximate overall failure rate, which can then be 
used to determine the MTTDL and the probability of data loss over any given 
time. 

We next construct an approximate model that computes MTTDL and the relia¬ 
bility of the system at any time t. This model allows any general distribution for 
the repair times. 
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There is a total of d + 1 disks per RAID group, which in the orthogonal arrange¬ 
ment means d + 1 strings, and g groups of disks. The total number of disks is 
therefore (d + l)g. Unlike our previous derivation, we will no longer assume that 
repair times are exponentially distributed: all we ask is that their distributions be 
known. Let /cjj s k(f) denote the density function of the disk repair time. 

The approximate rate at which individual failures trigger data loss in a given 
disk is given by kdisk^indiv/ where k^isk is the failure rate of a single disk and 7Tj n( jj v 
is the probability that a given individual failure triggers data loss. To calculate 
jri n div/ recall that it is the probability that another disk fails in the affected RAID 
group while the previous failure has not yet been repaired. But this failure happens 
at the rate d(kdisk + k s t r ), since the second disk failure can happen either due to an 
individual disk, or string, failure. Let r denote the (random) disk repair time. The 
probability of data loss conditioned on the event that the repair of the first disk 
takes time r is 


Prob|Data loss|the repair takes r} = 1 — e ^b-disk+'-stiL 

r has the density function/di s k(-)l hence, the unconditional probability of data loss 
is 


jr m dj v = / ProbjData loss|the repair takes r} -/di s k( r ) dr 
Jo 

poo 

= (l-e-^isk+Wr ) /disk(T)dr 

Jo 

pOO pOO 

= /disk( r )dr— 1 e~ d{kdisk+kstr)T f disk (r) dr 

Jo Jo 

= 1 - f disk(^disk + ^str]) (3.14) 

where F^ isk (-) ' s the Laplace transform of/di s k(-)- Since there are ( d + 1 )g data disks 
in all, the approximate rate at which data loss is triggered by individual disk fail¬ 
ure is given by 


hindiv ~ (d + l^bjjgkjl P(jj s k(d[kclisk "b k s t r ]) } (3.15) 

Why is this approximate and not exact? Because (d + l)gA.di s k is the rate at which 
individual disk failures occur in a fault-free system. Since the probability is very 
high that the system is entirely fault-free (if the repair times are much smaller 
than the time between failures and the size of the system is not excessively large), 
this is usually a good approximation. It does have the merit of not imposing any 
limitations on the distribution of the repair time. 

Let us now turn to computing A str , the rate at which data loss is triggered by 
a string failure. The total rate at which strings fail (if all strings are up) is (d + 
l)A. str . When a string fails, we have to repair the string itself and then make any 
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necessary repairs to the individual disks, which may have been affected by the 
string failure. We will make the pessimistic approximation that failure can happen 
if another failure occurs in any of the groups or any of the disks before all of the 
groups are fully restored. This is pessimistic because there are instances that are 
counted as causing data loss that, in fact, do not do so. For example, we will count 
as causing a data failure the occurrence of two string failures in the same string, the 
second occurring before the first has been repaired. We can also make the optimistic 
assumption that the disks affected by the triggering string failure are all immune 
to a further failure before the string and all its affected disks are fully restored. The 
difference between the failure rates predicted by these two assumptions will give 
us an idea of how tight the pessimistic bound happens to be. 

Let r be the (random) time taken to repair the failed string and all of its con¬ 
stituent disks that may have been affected by it. Let/ S t r (-) be the probability density 
function of this time. Then, under the pessimistic assumption, additional failures 
occur at the rate k peS s = (d +1 )/- str + (d + l)ykdi s k- Under the optimistic assumption, 
additional failures occur at the rate k opt = dk str + dykdisk- 

A data loss will therefore be triggered in the pessimistic model with the con¬ 
ditional (upon r) probability p peS s = 1 — e -A P essT and in the optimistic model with 
the conditional (upon r) probability p opt = 1 — e~ ; '°i ,ir . Integrating on r, we obtain 
the unconditional pessimistic and optimistic estimates: 7r pess = 1 — F str (A. pess ) and 
7T op t = 1 — F* tr (A. op t), respectively, where F* tr (-) is the Laplace transform of / s tr(-)- 
The pessimistic and optimistic rates at which a string failure triggers data loss are 
therefore given by 


Astr_ P ess — (d + l)A. s t r 7T pess 

Astr_opt — (d T 1 )k s tr j7 T op t (3.16) 

The rate at which data loss happens in the system is therefore approximately: 

I hindiv + A s tr_ p ess under the pessimistic assumption 

(3.17) 

hindiv + Agtr opt under the optimistic assumption 
From this, we immediately have that 

\ 

MTTDL «-, R(t) «= e _ ^ data - lossf (3.18) 

Adata_loss 

as approximations of the MTTDL and reliability of the system, respectively. 

3.3 Data Replication 

Data replication in distributed systems is another example of how information 
redundancy can be used for improved fault tolerance at the system level. Data 
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FIGURE 3.24 Disconnection endangers the correct operation of data replication. 

replication consists of holding identical copies of data on two or more nodes in a 
distributed system. As with a RAID system, a suitably managed data replication 
scheme can offer both fault tolerance and improved performance (because one can, 
for example, read data from nearby copies). However, it is important that the data 
replicates be kept consistent, despite failures in the system. 

Consider, for example, a situation in which we keep five copies of the data: one 
copy on each of the five nodes of a distributed system, connected as shown in 
Figure 3.24a. Suppose that a read or a write request may arrive to any of the five 
nodes on the bidirectional links in the figure. As long as all five copies are kept 
consistent, a read operation can be sent to any of the nodes. However, suppose 
two of the links fail, as shown in Figure 3.24b. Then, node A is disconnected from 
nodes B and C. If a write operation updates the copy of the datum held in A, this 
write cannot be sent to the other nodes and they will no longer be consistent with 
A. Any read of their data will therefore result in stale data being used. 

In what follows we describe two approaches to managing the replication of 
data through the assignment of weights (votes) to the individual copies: a non- 
hierarchical scheme and a hierarchical one. Such votes allow us to prefer copies 
that reside on more reliable and better connected nodes. We will assume that all 
faulty nodes can be recognized as such: no malicious behavior takes place. 

3.3.1 Voting: Non-Hierarchical Organization 

We next present a voting approach to handling data replication. To avoid confu¬ 
sion, we emphasize that we do not vote on multiple copies of data. If we read r 
copies of some data structure, we select one with the latest timestamp. We assume 
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that data coding is used to detect/correct data errors in storage or transmission. 
Voting is not used for this purpose but solely to specify minimum sets of nodes 
that need to be updated for a write operation or that need to be accessed for a read 
operation to be completed. 

The simplest voting scheme is the following. Assign Vj votes to copy i of that 
datum and let S denote the set of all nodes with copies of the datum. Define v to 
be the sum of all the votes, v = Vj. Define integers r and w with the following 
properties: 


r + w>v, w>v/1 

Let V(X) denote the total number of votes assigned to copies in set X of nodes. The 
following strategy ensures that all reads are of the latest data. 

To complete a read, it is necessary to read from all nodes of a setRcS such that V(R) ^ 
r. Similarly, to complete a write, we must find a set W c S such that V( W) ^ w, and 
execute that write on every copy in W. 

This procedure works because for any sets R and W such that V(R) ^ r and V(W) ^ 
w, we must have R fi W 0 (because r + w > v). Hence, any read operation is 
guaranteed to read the value of at least one copy which has been updated by the 
latest write. Furthermore, for any two sets Wj, W 2 such that V(fN\),V{\N 2 ) ^ w, we 
must have Wi fl W 2 f 0. This prevents different writes to the same datum from 
being done concurrently and guarantees that there exists at least one node that 
gets both updates. 

Any set R such that V(R)^r is said to be a read quorum, and any set W such that 
V(W) > zv is called a write quorum. 

How would this system work for the example shown in Figure 3.24? Assume 
we give one vote to each node: the sum of all votes is thus v = 5. We must have 
zv > 5/2, so zv e {3,4,5}. Since r + zv > v, we must have r > v — zv. The following 
combinations are permissible: 

(r, w) e {(1,5), (2,5), (3,5), (4,5), (5,5), (2,4), (3,4), (4,4), (5,4), (3,3)} 

Consider the case ( r,zv ) = (1,5). A read operation can be successfully completed 
by reading any one of the five copies; however, to complete a write, we have to 
update every one of the five copies. This ensures that every read operation gets 
the latest update of the data. If we pick zv — 5, it makes no sense to set r > 1, which 
would needlessly slow down the read operation. In this case, we can still continue 
to read from each node even after the failures disconnect the network, as shown in 
Figure 3.24b. However, it will be impossible to update the datum, since we cannot, 
from any source, reach all five copies to update them. 

As another example, consider (r, zv) = (3,3). This setting has the advantage of 
requiring just three copies to be written before a data update is successfully com¬ 
pleted. However, read operations now take longer because each overall read op¬ 
eration requires reading three copies rather than one. With this system, after the 
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network disconnection, read or write operations coming into node A will not be 
served. However, the four nodes that are left connected can continue to read and 
write as usual. 

The selected values of r and w will affect the performance of the system. If, for 
instance, there are many more reads than writes, we may choose to keep r low 
to speed up the read operations. However, selecting r — 1 requires setting w = 5, 
which means that writes can no longer happen if even one node is disconnected. 
Picking r — 2 allows w = 4: the writes can still be done if four out of the five nodes 
are connected. We therefore have a tradeoff between performance and reliability. 

The problem of assigning votes to nodes in such a way that availability is max¬ 
imized is very difficult (the system availability is the probability that both read 
and write quorums are available). We therefore present two heuristics that usually 
produce good results (although not necessarily optimal). These heuristics allow us 
to use a general model that includes node and link failures. Assume that we know 
the availability of each node i: a n (i) and of each link j: n f (/). Denote by L(i) the set 
of links incident on node i. 

Heuristic 1. Assign to node i a vote v(i) = a n (i) J2jeL(i) a td) rounded to the nearest 
integer. If the sum of all votes assigned to nodes is even, give one extra vote to 
one of the nodes with the largest number of votes. 

Heuristic 2. Let k(i,j) be the node that is connected to node i by link j. Assign to 
node i a vote v(i) = a„(i) + J2j€L(i) a i(j) a n(k(i/j)) rounded to the nearest integer. 
As with Heuristic 1, if the sum of the votes is even, give one extra vote to one of 
the nodes with the largest number of votes. 

As an example, consider the system in Figure 3.25. The initial assignment due 
to Heuristic 1 is as follows: 

v(A) = round(0.7 x 0.7) = 0 
v(B) — round(0.8 x 1.8) = 1 
v(C) = round(0.9 x 1.6) = 1 
v(D) — round(0.7 x 0.9) = 1 

Note that Heuristic 1 gives node A 0 votes. This means that A and its links are so 
unreliable compared to the rest that we may as well not use it. The votes add up 
to 3, and so the read and write quorums must satisfy the requirements: 

r + zv> 3, zv >3/2 

Consequently, zv e {2,3}. If we set zv — 2, we have r = 2 as the smallest read quo¬ 
rum. The possible read quorums are therefore {BC,CD,BD}; these are also the pos¬ 
sible write quorums. 

If we set zv = 3, we have r — 1 as the smallest read quorum. The possible read 
quorums are then {B, C,D}, and there is only one write quorum: BCD. 
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FIGURE 3.25 Vote assignment example (numbers indicate availabilities). 


Under Heuristic 2, we have the following vote assignment: 

v(A) = round(0.7 + 0.7 x 0.8) = 1 

v(B) = round(0.8 + 0.7 x 0.7 + 0.9 x 0.9 + 0.2 x 0.7) = 2 

v(C) = round(0.9 + 0.9 x 0.8 + 0.7 x 0.7) = 2 

v(D) = round(0.7 + 0.2 x 0.8 + 0.7 x 0.9) = 1 

Since the votes add up to an even number, we give B an extra vote, so that the 
final vote assignment becomes: v(A) = 1, v(B) — 3, v(C) — 2, v(D) = 1. The votes 
now add up to 7, so that the read and write quorums must satisfy 

r + zv>7, w>7/2 

Consequently, iv e {4,5,6,7}. Table 3-7 shows read and write quorums associated 
with r + iv — 8. We invite the reader to augment the table by listing the availability 
associated with each given (; r,w ) pair: this is, of course, the probability that at least 
one read and one write quorum can be mustered despite node and /or link failures. 

We illustrate the process by solving the problem for (r,zv) = (4,4). The avail¬ 
ability in this case is the probability that at least one of the quorums AB, BC, BD, 
ACD can be used. We compute this probability by first calculating the availabilities 
of the individual quorums. Quorum AB can be used if A, B, and the single path 
connecting them are up. The probability of this occurring is 

ProbfAB can be used} = a n (A)a n (B)ai(lAB) = 0.7 • 0.8 • 0.7 = 0.392 

where ujQab) is the availability of the link I a a connecting the two nodes A and B. 
Quorum BC will be usable if B, C and at least one of the two paths connecting 
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TABLE 3-7 Read and write quorums under heuristic 2 


r w Read quorums Write quorums 

4 4 AB, BC, BD, ACD AB, BC, BD, ACD 

3 5 B, AC, CD BC, ABD 

2 6 B, C, AD ABC, BCD 

1 7 A, B, C, D ABCD 


them are up. This probability can be calculated as follows: 

ProbfBC can be used} = a n (B)a n (C)[ai(l B c) + ai(lBD)a„(D)ai(l DC ){l - «/(1bc))] 

= 0.8 • 0.9[0.9 + 0.2 • 0.7 • 0.7 • 0.1] = 0.655 

Similarly, we can calculate the availabilities of the quorums BD and ACD. How¬ 
ever, to compute the system availability, we cannot just add up the availabilities 
of the individual quorums because the events "quorum i is up" are not mutually 
exclusive. Instead, we would have to calculate the probabilities of all intersections 
of these events and then substitute them in the inclusion and exclusion formula, 
which is quite a tedious task. An easier and more methodical way of computing 
the system availability is to list all possible combinations of system components' 
states, and add up the probabilities of those combinations for which a quorum ex¬ 
ists. In our example, the system has eight components (nodes and links), each of 
which can be in one of two states: "up" and "down," with 2 8 = 256 system states 
in all. The probability of each state is a product of eight terms, each taking one of 
the following forms: a n (i), (1 — a n {ij), «/(/), or (1 — «/(/)). For each such state, we can 
establish whether a read quorum or a write quorum exists, and the availability of 
the system is the sum of the probabilities of the states in which both read and write 
quorums exist. 

For (r, iv) = (4,4), the lists of read quorums and write quorums are identical. For 
any other value of ( r,w ), these lists are different, and to calculate the availability 
of the system, we must take into consideration the relative frequencies of read and 
write operations and multiply these by the probabilities that a read quorum and a 
write quorum exists, respectively. 

A write quorum must consist of more than half the total number of votes. A sys¬ 
tem that is not easily or rapidly repaired, however, could degrade to the point at 
which no connected cluster exists that can muster a majority of the total votes. In 
such a case, no updates can be carried out to any data even if a sufficiently large 
portion of the system remains operational. 

This problem can be countered by dynamic vote assignment. Instead of keeping 
the read and write quorums static, we alter them to adjust to prevailing system 
conditions. In the discussion that follows, we assume that each node has exactly 
one vote. It is not difficult to relax this restriction. 

For each datum, the algorithm consists of maintaining version numbers, VNj, 
with each copy of that datum at each node i. Every time a node updates a datum. 
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1. If an update request arrives at node i, node i computes the following quanti¬ 
ties: 

• M = ma x{VNj,j £ Si} (where Si is the set of nodes with which node i 
can communicate, including i itself), i.e., the maximum version number 
of the concerned datum, among all the nodes with which node i can 
communicate. 

• I = [j\VNj = M,j £ Si}, i.e., the set of all nodes whose version number 
is equal to the maximum. 

• N = maxjS'Cj, j £ /}, i.e., the maximum update sites cardinality asso¬ 
ciated with all the nodes in 7. 

2. If ||71| > N/2, then node i can raise a write quorum and is allowed to carry 
out the update on all nodes in 7; otherwise the update is not allowed. The 
update is carried out and the version number of each copy of that datum in 7 
is incremented, i.e., VNi is incremented for each i £ 7. Also, for each i e 7, we 
set SCi = ||7j|. This entire step must be done atomically: all these operations 
must be done at each node in 7, or none of them can be done. 


FIGURE 3.26 Algorithm for dynamic vote assignment (||I|| is the cardinality of set I). 

the corresponding version number is incremented. Assume that an update arrives 
at a node. This can only be executed if a write quorum can be gathered. The update 
sites cardinality at node i, denoted by SC/, is the number of nodes that participated 
in the VNj th update of that datum. When the system starts operation, SC/ is ini¬ 
tialized to the total number of nodes in the system. The algorithm in Figure 3.26 
shows how the dynamic vote assignment procedure works. 

The following example illustrates the algorithm. Suppose we start with seven 
nodes, all carrying copies of some datum. The state at time to is as follows: 

A B C D E F G 
VN 5555555 
SC 7777777 

Suppose now that at time to a failure occurs in the system, disconnecting the sys¬ 
tem into two connected components: {A,B,C,D} and {£,F,G}. No element in one 
component can communicate with any element in the other. Suppose £ receives 
an update request at time t\ > to- Since SCe — 7,E has to find more than 7/2 (i.e., 
four or more) nodes (including itself), to consummate that update. However, £ can 
only communicate with two other nodes, F and G, and so the update request must 
be rejected. 

At time t 2 > to, an update request arrives at node A, which is connected to three 
other nodes, and so the request can be honored. The update is carried out on A, B, 
C, and D, and the new state becomes the following: 
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A B C D E F G 

VN 6666555 
SC 4444777 

At time 1 3 > t 2 , there is a further failure: the connected components of the net¬ 
work become {A,B,C}, {D}, {E,F,G}. At time > 1 3 , an update request arrives at 
C. The write quorum at C consists of just three elements now (i.e., the smallest 
number greater than SCc/2), and so the update can be successfully completed at 
nodes A, B, and C. The state is now: 

A B C D E F G 

VN 7776555 
SC 3334777 

What protocols must be followed to allow nodes to rejoin the components after 
having been disconnected from them? We leave their design to the reader. 

3.3.2 Voting: Hierarchical Organization 

The obvious question that now arises is whether there is a way to manage data 
replication that does not require that r + iv > v. If v is large (which can happen if a 
large number of copies is to be maintained), then data operations can take a long 
time. One solution is to have a hierarchical voting scheme as follows. 

We construct an nz-level tree in the following way. Let all the nodes holding 
copies of the data be the leaves at level m — 1. We then add virtual nodes at the 
higher levels up to the root at level 0. All the added nodes are only virtual group¬ 
ings of the real nodes. Each node at level i will have the same number of children, 
denoted by lj + As an example, consider Figure 3.27. In this tree, t\ —I 2 — 3. 













96 


CHAPTER 3 Information Redundancy 


We now assign one vote to each node in the tree and define the read and write 
quorum sizes, r, and w u respectively, at level i to satisfy the inequalities: 

Ti + Wi>ii, Wi>l{/2 

Then, the following algorithm is used to recursively assemble a read and write 
quorum at the leaves of the tree. Read-mark the root at level 0. Then, at level 1, 
read-mark r\ nodes. When proceeding from level i to level i + 1, read-mark r l+ \ 
children of each of the nodes read-marked at level i. It is not allowed to read- 
mark a node that does not have at least r,q_i nonfaulty children: if this was done, 
we need to backtrack and undo the marking of that node. Proceed like this until 
i = m — 1. The leaves that have been read-marked form a read-quorum. Forming a 
write-quorum is similar. 

For the tree in Figure 3.27, let us select Wj — 2 for i = 1,2, and set r, = lj — Wj +1 = 
2. Starting at the root, read-mark two of its children, say X and Y. Now, read-mark 
two children for X and Y, say A, B for X, and D, E for Y. The read quorum is the 
set of read-marked leaves, namely, A,B,D, and E. 

Suppose D had been faulty. Then, it cannot be part of the read-quorum, so we 
have to pick another child of Y, namely, F, to be in the read-quorum. If two of Y's 
children had been faulty, we cannot read-mark Y and have to backtrack and try 
read-marking Z instead. 

As an exercise, the reader should list read quorums generated by other values 
for r, and w r For example, try r\ — l,w\ = 3, r 2 = 2,W2 = 2. 

Note that the read quorum consists of just four copies. Similarly, we can gener¬ 
ate a write quorum with four copies. If we had tried the non-hierarchical approach 
with one vote per node, our read and write quorums would have had to satisfy 
the conditions r + zv >9; zv >9/2. Hence, the write quorum in the non-hierarchical 
approach is of size at least 5, whereas that for the tree approach is 4. 

Given each read and write quorum, the topology of the interconnection net¬ 
work, and the probability of node and link failures, we can, for each assignment of 
r, and Wj, list the probability that read and write quorums will exist in any given 
system. 

How can we prove that this approach does, in fact, work? We do so by showing 
that every possible read quorum has to intersect with every possible write quorum 
in at least one node. This is not difficult to do, and we leave it as an exercise to the 
reader. 

3.3.3 Primary-Backup Approach 

Another scheme for managing data replicas is the primary-backup approach, in 
which one node is designated as the primary, and all accesses are through that 
node. The other nodes are designated as backups. Under normal operation, all 
writes to the primary are also copied to the functional backups. When the primary 
fails, one of the backup nodes is chosen to take its place. 
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Let us now consider the details of this scheme. We start by describing how 
things work in the absence of failures. All requests from users (clients in the client- 
server terminology) are received by the primary server. It forwards the request to 
the copies and waits until it receives an acknowledgment from all of them. Once 
the acknowledgments are in, the primary fulfills the client's request. 

All client requests must pass through the primary; it is the primary that serial¬ 
izes them determining the order in which they are served. All messages from the 
primary are numbered, so that they can be processed by the backups in the order 
in which they are sent. This is extremely important, because changing the order in 
which requests are served could result in entering an incorrect state. 


■ EXAMPLE 

The primary receives a request, R to deposit $1000 in Mr. Smith's bank ac¬ 
count. This is followed by a request, Rt, to transfer $500 out of his account. He 
had $300 in his bank balance to begin with. 

Suppose the primary receives R ( / first and then Rf. It forwards these messages 
in that order to each of the backups. Suppose backup B\ receives R ( / first and 
then Rf. B\ can process them in that order, leaving $800 in Mr. Smith's account. 
Now, suppose backup B 2 receives Rt first and then R,/. Rt cannot be honored: 
Mr. Smith does not have enough money in his account. Hence, the transfer 
request is denied in the copy of the account held by B 2 . Bi and B 2 are now no 
longer consistent. ■ 


In the absence of failures, it is easy to see that all the copies will be consistent if we 
follow this procedure. We now need to augment it to consider the case of failure. 
We will limit ourselves here to fail-stop failures, which are failures that result in 
silence. Byzantine failures (in which nodes can send back lying messages and do 
arbitrary things to their copies of the data) are not covered. 

Start by considering network failures. If the network becomes disconnected as 
a result of these failures, then it is only the component that is reachable by the 
primary that can take part in this algorithm. All others will fall out of date and 
will need to be reinitialized when the network is repaired. 

Next, consider the loss of messages in the network. This can be handled by 
using a suitable communication algorithm, which retransmits messages until an 
acknowledgment is received. Hence, we can assume in what follows that if a mes¬ 
sage is transmitted, it will ultimately be received, unless we have a node failure. 

Now, let us turn to node failures. Suppose one of the backups has failed and 
never returns an acknowledgment. The primary has to wait to receive an acknowl¬ 
edgment from each of the backups; if a backup fails, it may have to wait forever. 
This problem is easy to remedy: introduce a timeout feature. If the primary does 
not receive an acknowledgment from the backup within a specified period, the 
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primary assumes the backup is faulty and proceeds to remove it from the group 
of backups. Obviously, the value that is used for the timeout depends on the inter¬ 
connection network and the speed of the processing. 

Next, consider the failure of the primary itself, and let us see how this affects the 
processing of some request, R. How complicated it can be to handle this case de¬ 
pends on when the primary goes down. If it fails before forwarding any copies of 
R to any of its backups, there is no possibility of inconsistency among the backup 
copies: all we have to do then is to designate one of the backups as the new pri¬ 
mary. This can be done by numbering the primary and each of the backups and 
always choosing the smallest-numbered functional copy to play the part of the 
primary. 

If it fails after forwarding copies of R to all of its backups, then again there is no 
inconsistency among the backup copies: they have all seen identical copies of R. 
All that remains then is to choose one of the backups to be the new primary. 

The third case is the most complex: the primary fails after sending out messages 
to some, but not all, of its backups. Such a situation obviously needs some correc¬ 
tive action to maintain consistency among the backups. This is a little complicated 
and requires us to introduce the concept of a group view. To begin with, when the 
system starts with the primary and all backups fully functional and consistent, the 
group view consists of all of these copies. Each element in this set is aware of the 
group view, in other words, each backup knows the full set of backups to whom 
the primary is forwarding copies of requests. Call this initial group view Go- At 
any point in time, there is a prevailing group view, which is modified as nodes fail 
and are repaired (as described below). 

Messages as received by the backups are classified by them as either stable or 
unstable. A stable message is one that has been acknowledged by all the backups in 
the current group view. Until an acknowledgment has been observed, the message 
is considered to be unstable. 

Suppose now that backup B, detects that the primary has failed. We will discuss 
below how such failure might be detected. Then, R, sends out a message announc¬ 
ing its findings to the other nodes in the current group view. A new group view is 
then constructed, from which the primary node is excluded and a new primary is 
designated. 

Before each node can install the new group view, it transmits to the other nodes 
in the old group view all the unstable messages in its buffer. This is followed by 
an end-of-stream message, announcing that all of its unstable messages have been 
sent. When it has received from every node in the new view an acknowledgment 
of these messages, it can proceed to assume that the new group view is now estab¬ 
lished. 

What if another node fails when this is going on? This will result in a waited-for- 
acknowledgment never being received: a timeout can be used to declare as faulty 
nodes that do not acknowledge messages, and the procedure of constructing yet 
another group view can be repeated. 
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This leaves us with the question of how the failure of a primary is to be discov¬ 
ered. There are many ways in which this can be done. For example, one may have 
each node run diagnostics on other nodes. Alternatively, we could require that the 
primary broadcast a message ("I am alive") at least once every T seconds, for some 
suitable T. If this requirement is not fulfilled, that could be taken as indicating that 
the primary is faulty. 

Finally, we should mention that this procedure allows for nodes to be repaired. 
Such a node would make its database consistent with that of the nodes in the 
prevailing group view and announce its accession to the group through a message. 
The nodes would then go through the procedure of changing the group view to 
accommodate this returning node. 


3.4 Algorithm-Based Fault Tolerance 

Algorithm-Based Fault Tolerance (ABFT) is an approach to provide fault detection 
and diagnosis through data redundancy. The data redundancy is not implemented 
at either the hardware or operating system level. Instead, it is implemented at 
the application software level and as a result, its exact implementation will differ 
from one class of applications to another. Implementing data redundancy is more 
efficient when applied to large arrays of data rather than to many independent 
scalars. Consequently, ABFT techniques have been developed for matrix-based 
and signal processing applications such as matrix multiplication, matrix inversion, 
LU decomposition and the Fast Fourier Transform. We will illustrate the ABFT 
approach through its application to basic matrix operations. 

Data redundancy in matrix operations is implemented using a checksum code. 
Given an n x m matrix A, we define the column checksum matrix Ac as 



where e — [11 • • • 1] is a row vector containing n Is. In other words, the elements in 
the last row of Ac are the checksums of the corresponding columns of A. Similarly, 
we define the roiv checksum matrix Ar as 

A R = [A Af] 

where/ =[11 • • • 1] T is a column vector containing m Is. Finally, the full (n + 1) x 
(m + 1) checksum matrix Ar is defined as 


A f = 


A 

eA 


Af 

eAf 


Based on the discussion in Section 3.1, it should be clear that the column or row 
checksum matrix can be used to detect a single fault in any column or row of A, 
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respectively, whereas the full checksum matrix can be used to locate an erroneous 
single element of A. If the computed checksums are accurate (overflows are not 
discarded), locating the erroneous element allows us to correct it as well. 

The above column, row, and full checksums can be used to detect (or correct) 
errors in various matrix operations. For example, we can replace the matrix addi¬ 
tion A + B = C by Aq + Bc = Cc or Ar + Br — Cr or Ar + Br — Cr. Similarly, instead 
of calculating AB — C, we may compute ABr = Cr or AqB — Cq or AqBr — Cr. 

To allow locating and correcting errors even if only a column or row checksum 
matrix is used (rather than the full checksum matrix), a second checksum value 
is added to each column or row, respectively. The resulting matrices are called 
column, row, and full-weighted matrices and are shown below: 


" A 


A 

Af 

Afw 

eA 

Ar = [A Af Af w ] Ar = 

eA 

eAf 

eAf w 

e w A 


e w A 

ClV-Af 

&wAf w _ 


where e w = [1 2 • • • 2 " _1 ] and f w = [12 • • • 2 '" _1 ] T . 

This Weighted-Checksum Code (WCC) can correct a single error even if only 
two rows or two columns are added to the original matrix. For example, sup¬ 
pose that Ac is used and an error in column j is detected. Denote by WCS1 
and WCS2 the values of the unweighted checksum eA and the weighted check¬ 
sum e w A in column j, respectively. We then calculate the error in the unweighted 
checksum Si = Yfi=i a i,j ~ WCS 1 and the error in the weighted checksum S 2 = 
Xw=i2 I-1 «;j — WCS2. If only one of these two error syndromes Si and S 2 is 
nonzero, then the corresponding checksum value is erroneous. If both Si and S 2 
are nonzero, S 2 /S 1 = 2 k ~ 1 implies that the element a^j is erroneous and can be 
corrected through «£ . = a^j — Si. 

The weighted checksum encoding scheme can be further extended to in¬ 
crease its error detection and correction capabilities by adding extra rows and /or 
columns with weights of the form e Wd — [l d_1 2 rf_1 ••• ( 2 " _1 ) rf_1 ] and f Wd = 
[l d ~l 2 rf_1 • • • ( 2 m-1 ) rf-1 ] T . Note that for d = 1 and d = 2, we obtain the above 
two (unweighted and weighted) checksums. If all the weights for d — 1,2,... ,v are 
used, the resulting weighted checksum encoding scheme has a Flamming distance 
of v + 1 , and as a result, is capable of detecting up to v errors and correcting up to 
\v/2\ . We will focus below only on the case of v — 2. 

For large values of n and m, the unweighted and weighted checksums can be¬ 
come very large and cause overflows. For the unweighted checksum, we can use 
the single-precision checksum scheme using two's complement arithmetic and 
discarding overflows. Discarding overflows implies that the sum will be calcu¬ 
lated modulo-2 f , where l is the number of bits in a word. If only a single element 
of the matrix A is erroneous, the error cannot exceed 2 l — 1, and the modulo-2' 
calculation performed for the single-precision checksum will provide the correct 
value of the syndrome Si. 
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The weighted checksum uses the weights [12 • • • 2 m_1 ] and would need more 
than l bits. We can reduce the largest value that the weighted checksum can as¬ 
sume by using a weight vector e w with smaller weights. For example, instead of 
[12 • • • 2" _1 ], we can use [12 • • • «]. For these weights, if both error syndromes 
Si and S 2 for column j are nonzero, S 2 /S 1 =k implies that the element a^j is erro¬ 
neous and it can be corrected as before through a' k - = a^j — Si. 

If floating-point arithmetic is used for the matrix operations, an additional com¬ 
plexity arises. Floating-point calculations may have roundoff errors that can result 
in a nonzero error syndrome Si even if all the matrix elements were computed 
correctly Thus, we must set an error bound S such that Sj < 8 will not signal a 
fault. The proper value of <5 depends on the type of data, the type of calculations 
performed, and the size of the matrix. Setting 8 too low will lead to roundoff er¬ 
rors misinterpreted as faults (causing false alarms), whereas setting it too high can 
reduce the probability of fault detection. One way to deal with this problem is 
to partition the matrix into submatrices and assign checksums to each submatrix 
separately The smaller size of these submatrices will greatly simplify the selection 
of a value for 8, which will provide a good tradeoff between the probability of 
a false alarm and the probability of fault detection. Partitioning into submatrices 
will slightly increase the complexity of the calculations but will allow the detec¬ 
tion of multiple faults even if only two (unweighted and weighted) checksums are 
used. 

3.5 Further Reading 

Many textbooks on the topic of coding theory are available. See, for example, [7- 

11.20.22.25.33.35.37.38.43- 45,52,56-58]. Cyclic codes are discussed in detail in [7, 

9.11.22.33.35.37.38.43- 45,49,52,56,57]. There are several websites that include de¬ 
scriptions of various codes and even software implementations of some of them 
[13,15,31,39,46,59]. Arithmetic codes are discussed in [3,4,30,48] and unidirectional 
codes are covered in [12,49]. 

Descriptions of RAID structures are widely available in textbooks on computer 
architecture. See also [14,23,42]. 

An excellent source for voting algorithms is [28]. Pioneering work in this area 
appears in [18] and [55]. Further key contributions are presented in [6,17]. Hier- 
archical voting is described in [32]. See also [47] for a discussion of the tradeoff 
between message overheads and data availability and [26,27] for dynamic vote as¬ 
signment, as well as [1,36] on quorums when servers may suffer Byzantine faults. 
The tradeoff between the load of a quorum system and its availability has been 
studied in [41]. The primary/backup approach to data-replica management can be 
found in [16,19,28,40,54], The references also discuss another approach to replica 
management, where no single node is designated as the primary, but each copy 
can manage client requests. This is called active replication or the state-machine ap¬ 
proach. 
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Algorithm-based fault tolerance was first proposed in [24] and further devel¬ 
oped in [5,29]. Alternative weights for the checksum codes are presented in [2,34] 
and extending the approach to floating-point calculations is discussed in [50,60]. 
Round-off errors in floating-point operations are described in [30]. 


3.6 Exercises 

1. Prove that it is possible to find at most 28 8-bit binary words such that the 
Hamming distance between any two of them is at least 3. 

2. To an //-bit word with a single-parity bit (for a total of (n + 1) bits), a second 
parity bit for the (n + l)-bit word has been added. How would the error de¬ 
tection capabilities change? 

3. Show that the Hamming distance of an M-of-N code is 2. 

4. Compare two parity codes for data words consisting of 64 data bits: (1) a (72,8) 
Hamming code and (2) a single-parity bit per byte. Both codes require 8 check 
bits. Indicate the error correction and detection capabilities, the expected over¬ 
head, and list the types of multiple errors that are detectable by these two 
codes. 

5. Show that a code can detect all unidirectional errors if and only if no two of its 
codewords are ordered. Two binary N-bit words X and Y are ordered if either 
Xj < \ji for all i e {1,2,.. ,,N} or X{ ^ y,- for all i e {1,2,... ,N}. 

6. A communication channel has a probability of 10 -3 that a bit transmitted on it 
is erroneous. The data rate is 12,000 bits per second (bps). Data packets contain 
240 information bits, a 32-bit CRC for error detection, and 0, 8, or 16 bits for 
error correction coding (ECC). Assume that if 8 ECC bits are added, all single¬ 
bit errors can be corrected, and if 16 ECC bits are added all double-bit errors 
can be corrected. 

a. Find the throughput in information bits per second of a scheme consisting 
of error detection with retransmission of bad packets (i.e., no error correc¬ 
tion). 

b. Find the throughput if eight ECC check bits are used, so that single-bit 
errors can be corrected. Uncorrectable packets must be retransmitted. 

C. Finally find the throughput if 16 ECC check bits are appended, so that 2-bit 
errors can be corrected. As in (b), uncorrectable packets must be retrans¬ 
mitted. Would you recommend increasing the number of ECC check bits 
from 8 to 16? 

7. Derive all codewords for the separable 5-bit cyclic code based on the generat¬ 
ing polynomial X + 1 and compare the resulting codewords to those for the 
nonseparable code. 
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8 . a. Show that if the generating polynomial G(X) of a cyclic code has more than 

one term, all single-bit errors will be detected. 

b. Show that if G(X) has a factor with three terms, all double-bit errors will 
be detected. 

C. Show that if G(X) has X + 1 as a factor, all odd numbers of bit errors will 
be detected. That is, if E(X ) contains an odd number of terms (errors), it 
does not have X + 1 as a factor. Also show that CRC-16 and CRC-CCITT 
contain X + 1 as a factor. What are the error detection capabilities of these 
cyclic codes? 

9. Given that X 7 - 1 = (X + l)g 1 (X)g 2 (X), where gi(X) = X 3 + X + 1, 

a. Calculate ^(X). 

b. Identify all the (7, k) cyclic codes that can be generated based on the factors 
of X 7 — 1. How many different such cyclic codes exist? 

C. Show all the codewords generated by g i(X) and their corresponding data 
words. 

10. Given a number X and its residue modulo-3, C(X) = |X| 3 ; how will the residue 
change when X is shifted by one bit position to the left if the shifted-out bit is 
0? Repeat this for the case where the shifted-out bit is 1. Verify your rule for 
X = 01101 shifted five times to the left. 

11. Show that a residue check with the modulus A — 2 a — 1 can detect all errors 
in a group of a — 1 (or fewer) adjacent bits. Such errors are called burst errors 
of length a — 1 (or less). 

12. You have a RAID1 system in which failures occur at individual disks at a 
constant rate X per disk. The repair time of disks is exponentially distributed 
with rate ft. Suppose we are in an earthquake-prone area, where building- 
destroying earthquakes occur according to a Poisson process with rate X e . If 
the building is destroyed, so too is the entire RAID system. Derive an ex¬ 
pression for the probability of data loss for such a system as a function of 
time. Assuming that the mean time between such earthquakes is 50 years, 
plot the probability of data loss as a function of time using the parameters 
1 /X — 500,000 hours and 1 / n — 1 hour. 

13. For a RAID level 3 system with d data disks and one parity disk, as d increases 
the overhead decreases but the unreliability increases. Suggest a measure for 
cost-effectiveness and find the value of d which will maximize your proposed 
measure. 

14. Given a RAID level 5 system with an orthogonal arrangement of d +1 strings 
and g = 8 RAID groups, compare the MTTDL for different values of d from 
4 to 10. Assume an exponential repair time for single disks and for strings 
of disks with rates of 1/hour and 3/hour, respectively. Also assume failure 
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rates for single disks and strings of disks of 10 6 /hour and 5 • 10 6 /hour, 
respectively. 

15. Derive expressions for the reliability and availability of the network shown in 
Figure 3.24a for the case ( r,iv ) = (3,3) where a single vote is assigned to each 
node in the nonhierarchical organization. In this case, both read and write op¬ 
erations can take place if at least three of the five nodes are up. Assume that 
failures occur at each node according to a Poisson process with rate X, but 
the links do not fail. When a node fails, it is repaired (repair includes loading 
up-to-date data) and the repair time is an exponentially distributed random 
variable with mean 1/fx. Derive the required expressions for the system re¬ 
liability and availability using the Markov chains (see Chapter 2) shown in 
Figure 3.28a and b, respectively, where the state is the number of nodes that 
are down. 

16. In Figure 3.28, a Markov chain is provided for the case in which nodes can be 
repaired in an exponentially distributed time. Suppose instead that the repair 
time was a fixed, deterministic time. Flow would this complicate the model? 

17. For the model shown in Question 15, suppose X — 10 -3 and fi — 1. Calculate 
the reliability and availability of each of the following configurations: (r,zv) — 
(3,3), (2,4), (1,5). 


5X AX 3X 



5X AX 3X 2X X 



FIGURE 3.28 Markov chains for Questions 15-16 ((r,zv) = ( 3 , 3 )). 
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FIGURE 3.29 An example network (numbers indicate availabilities). 


18. For the example shown in Figure 3.29, the four nodes have an availability of 1, 
while the links have the availabilities indicated in the figure. Use Fleuristic 2 
to assign votes to the four nodes, write down the possible values for iv and 
the corresponding minimal values of r, and calculate the availability for each 
possible value of ( r,iv ). Assume that read operations are twice as frequent as 
write operations. 

19. Prove that in the hierarchical quorum generation approach in Section 3.3.2, 
every possible read quorum intersects with every possible write quorum in at 
least one node. 

20 . Consider the tree shown in Figure 3.27. If p is the probability that a leaf node 
is faulty, obtain an expression for the probability that read and write quorums 
exist. Assume that r\ = T2 = W\ — u>2 — 2 and that nodes at levels 0 and 1 do 
not fail. 

21 . Show how checksums can be used to detect and correct errors in a scalar by 
matrix multiplication for the following example. Assume a 3 x 3 matrix: 


A = 


1 2 
4 5 
7 8 


3 

6 

9 


Show the corresponding column-weighted matrix Ac and assume that during 
the multiplication of Ac by the scalar 2 a single error has occurred resulting 
in the following output: 


2 

8 

14 


6 

12 

18 


2-A = 


4 

10 

17 
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Fault-Tolerant 

Networks 


Interconnection networks are widely used today. The simplest example is a 
network connecting processors and memory modules in a shared-memory mul¬ 
tiprocessor, in which processors perform read or write operations in the memory 
modules. Another example is a network connecting a number of processors (typ¬ 
ically with their own local memory) in a distributed system, allowing the proces¬ 
sors to communicate through messages while executing parts of a common appli¬ 
cation. In these two types of network, the individual components (processors and 
memories) are connected through a collection of links and switchboxes, where a 
switchbox allows a given component to communicate with several other compo¬ 
nents without having a separate link to each of them. 

A third type of networks, called ivide-area netivorks, connects large numbers of 
processors that operate independently (and typically execute different and unre¬ 
lated applications), allowing them to share various types of information. In such 
networks, the term packet is often used instead of message (a message may consist 
of several packets, each traversing the network independently), and they consist 
of more complicated switchboxes called routers. The best known example of this 
kind of network is the Internet. 

The network's links and switchboxes establish one or more paths between the 
sender of the message (the source) and its receiver (the destination). These links and 
switchboxes can be either unidirectional or bidirectional. The specific organiza¬ 
tion, or topology, of the network may provide only a single path between a given 
source and a given destination, in which case any fault of a link or switchbox along 
the path will disconnect the source-destination pair. Fault tolerance in networks is 
thus achieved by having multiple paths connecting source to destination, and / or 
spare units that can be switched in to replace the failed units. 
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Many existing network topologies contain multiple paths for some or all 
source-destination pairs, and there is a need to evaluate the resilience to faults 
provided by such redundancy, as well as the degradation in the network opera¬ 
tion as faults accumulate. 

We begin this chapter by presenting several measures of resilience/fault- 
tolerance in networks. Then, we turn to several well-known network topologies 
used in distributed or parallel computing, analyze their resilience in the pres¬ 
ence of failures, and describe ways of increasing their fault tolerance. We restrict 
ourselves in this chapter to networks meant for use in parallel and distributed 
computer systems. This field of network fault tolerance is large, and we will only 
be providing a brief sampling in this chapter. Pointers for further reading can be 
found toward the end. 

There is a vast literature on adaptive routing and recovery from lost packets in 
the field of wide-area networks: for that material, the reader should consult one of 
the many available books on computer networking. 


Measures of Resilience 

To quantify the resilience of a network or its degradation in the presence of node 
and link failures, we need measures, several of which are presented in this section. 
We start with generic, graph-theoretical measures and then list several measures 
specific to fault tolerance. 

4.1.1 Graph-Theoretical Measures 

Representing the network as a graph, with processors and switchboxes as nodes 
and links as edges, we can apply resilience measures used in graph theory. Two 
such measures are: 

■ Node and Link Connectivity. Perhaps the simplest consideration with re¬ 
spect to any network in the presence of faults is whether the network as a 
whole is still connected in spite of the failures, or whether some nodes are 
cut off and cannot communicate with the rest. Accordingly, the node (link) 
connectivity of a graph is defined as the minimum number of nodes (links) 
that must be removed from the graph in order to disconnect it. (When a 
node is removed, all links incident on it are removed as well.) Clearly, the 
higher the connectivity, the more resilient the network is to faults. 

■ Diameter Stability. The distance between a source and a destination node in 
a network is defined as the smallest number of links that must be traversed 
in order to forward a message from the source to the destination. The di¬ 
ameter of a network is the longest distance between any two nodes. Even if 
the network has multiple paths for every source-destination pair, we must 
expect the distance between nodes to increase as links or nodes fail. Diam¬ 
eter stability focuses on how rapidly the diameter increases as nodes fail in 
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the network (recall that the term nodes refers not only to processors but to 
switchboxes as well). A deterministic instance of such a measure is the per¬ 
sistence, which is the smallest number of nodes that must fail in order for the 
diameter to increase. For example, the persistence of a cycle graph is 1: the 
failure of just one node causes a cycle of n nodes to become a path of n — 1 
nodes, and the diameter jumps from |_n/2J to n — 2. A probabilistic measure 
of diameter stability is the vector 


DS = (prf+i,Prf+ 2 , • • •) 


where pd+i is the probability that the diameter of the network increases from 
d to d + i as a result of faults that occur according to some given probability 
distribution. In these terms, p 00 is the probability of the diameter becoming 
infinite, namely, the graph being disconnected. 

4.1.2 Computer Networks Measures 

The following measures express the degradation of the dependability and per¬ 
formance of a computer network in the presence of faults better than the rather 
generic measures listed above. 

■ Reliability. We define R(t), the network reliability at time f, as the probability 
that all the nodes are operational and can communicate with each other over 
the entire time interval [0, t]. If no redundancy exists in the network, R(t) is 
the probability of no faults occurring up to time t. If the network has spare 
resources in the form of redundant nodes and / or multiple paths between 
source-destination pairs, the fact that the network is operational at time t 
means that any failed processing node has been successfully replaced by a 
spare, and even if some links failed, every source-destination pair can still 
communicate over at least one fault-free path. 

If a specific source-destination pair is of special interest, we define the 
path reliability —sometimes called terminal reliability —as the probability that 
an operational path has existed for this source-destination pair during the 
entire interval [0, t]. 

An important point to emphasize here is that the reliability (and for that 
matter, also the graph-theoretical measures listed above) does not include 
the option of repairing the network (other than switching in a spare), al¬ 
though the management of most actual networks involves the repair or re¬ 
placement of any faulty component. The reason for that omission is that the 
reliability measure is intended to give an assessment of the resilience of the 
network, possibly compared to other similar networks. Also, in many cases 
repair is not always possible or immediate and may be very expensive. If 
repair is an integral component of the system's management, availability (as 
defined in Chapter 2) can be used instead of reliability 



112 


CHAPTER 4 Fault-Tolerant Networks 


■ Bandwidth. The meaning of bandwidth depends on its context. For a com¬ 
munications engineer, the bandwidth of a channel often stands for the range 
of frequencies that it can carry. The term can also mean the maximum rate at 
which messages can flow in a network. For example, a particular link could 
be specified as being able to carry up to 10 Mbits per second. One can also 
use the term in a probabilistic sense: for a certain pattern of accesses to a file 
system, we can use the bandwidth to mean the average number of bytes per 
second that can be accessed by this system. 

The maximum rate at which messages can flow in a network (the theo¬ 
retical upper bound of the bandwidth) usually degrades as nodes or links 
fail in a network. In assessing a network, we are often interested in how this 
expected maximum rate depends on the failure and repair rates. 

■ Connectability. The node and link connectivity as defined above are rather 
simplistic measures of network vulnerability and say nothing about how 
the network degenerates before it becomes completely disconnected. A 
more informative measure is connectability: the connectability at time f, 
denoted by Q(f), is defined as the expected number at time t of source- 
destination pairs which are still connected in the presence of a failure 
process. This measure is especially applicable to a shared memory multi¬ 
processor, where Q(f) denotes the expected number of processor-memory 
pairs that are still communicating at time t. 


4.2 Common Network Topologies and 
Their Resilience 

We present in this section examples of two types of network. The first type con¬ 
nects a set of input nodes (e.g., processors) to a set of output nodes (e.g., mem¬ 
ories) through a network composed only of switchboxes and links. As examples 
for this type, we use the multistage and crossbar networks with bandwidth and 
connectability as measures for their resilience. The second type is a network of 
computing nodes that are interconnected through links. No separate switchboxes 
exist in these networks; instead, the nodes serve as switches as well as proces¬ 
sors and are capable of forwarding messages that pass through them on the way 
to their final destination. The networks we use as examples for this type are the 
mesh and the hypercube, and the applicable measures for these networks are the 
reliability/path reliability or the availability, if repair is considered. 


4.2.1 Multistage and Extra-Stage Networks 

Multistage networks are commonly used to connect a set of input nodes to a set of 
output nodes through either unidirectional or bidirectional links. These networks 
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(a) Straight 



(b) Cross 



(c) Upper broadcast (d) Lower broadcast 


FIGURE 4.1 2x2 switchbox settings. 
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FIGURE 4.2 An 8 x 8 butterfly network. 

are typically built out of 2 x 2 switchboxes. These are switches that have two in¬ 
puts and two outputs each, and can be in any of the following four settings (see 
Figure 4.1): 

■ Straight. The top input line is connected to the top output, and the bottom 
input line to the bottom output. 

■ Cross. The top input line is connected to the bottom output, and the bottom 
input line to the top output. 

■ Upper Broadcast. The top input line is connected to both output lines. 

■ Lower Broadcast. The bottom input line is connected to both output lines. 

A well-known multistage network is the butterfly. As an example see the three- 
stage butterfly connecting eight inputs to eight outputs shown in Figure 4.2. We 
have numbered each line in every switchbox such that a switchbox in stage i has 
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lines numbered 2 l apart. Output line j of every stage goes into input line j of the 
following stage, for j — 0,... ,7. Such a numbering scheme is probably the easiest 
way to remember the butterfly structure. 

A 2 k x 2 k butterfly network connects 2 I: inputs to 2 k outputs and is made up of 
k stages of 2 k ~ 1 switchboxes each. The connections follow a recursive pattern from 
the input end to the output. For example, the 8x8 butterfly network shown in 
Figure 4.2 is constructed out of two 4x4 butterfly networks plus an input stage 
consisting of four switchboxes. In general, the input stage of a k-stage butterfly 
(k ^ 3) has the top output line of each switchbox connected to an input line of one 
2 k ~ 1 x 2 k ~ 1 butterfly, and the bottom output line of each switchbox connected to an 
input line of another 2 fc_1 x 2 k ~ 1 butterfly. The input stage of a two-stage butterfly 
(see the 4x4 butterfly in Figure 4.2) has the top output line of each of its two 
switchboxes connected to one 2x2 switchbox, and the bottom output line to the 
second 2x2 switchbox. 

An examination of the butterfly quickly reveals that the butterfly is not fault 
tolerant: there is only one path from any given input to any specific output. In 
particular, if a switchbox in stage i were to fail, there would be 2 k ~ T inputs which 
could no longer connect to any of 2 1+1 outputs. The node and link connectivities 
are therefore each equal to 1. For example, if the switchbox in stage 1 that is labeled 
a in Figure 4.2 fails, the 2 3-1 = 4 inputs 0,2,4, and 6 will become disconnected from 
the 2 1+1 = 4 outputs 4, 5, 6, and 7. 

One way to render the network fault tolerant is to introduce an extra stage, by 
duplicating stage 0 at the input. In addition, bypass multiplexers are provided to 
route around switchboxes in the input and output stages. If a switchbox in these 
stages is faulty, such a multiplexer can be used to route around the failure. An 
8x8 extra-stage butterfly is shown in Figure 4.3. This network can remain con¬ 
nected despite the failure of up to one switchbox anywhere in the system. Sup¬ 
pose, for example, that the stage-0 switchbox carrying lines 2,3 fails. Then, what¬ 
ever switching it would have done can be duplicated by the extra stage, while 
the failed box is bypassed by the multiplexer. Or, suppose that the switchbox in 
stage 2 carrying lines 0,4 fails. Then, the extra stage can be set so that input line 
0 is switched to output line 1, and input line 4 to output line 5, thus bypassing 
the failed switchbox. Proving formally that this network can tolerate up to one 
switchbox failure is quite easy and is left as an exercise for the reader. This proof is 
based on the fact that because the line numbers in any stage-/ box are 2' apart, the 
numbers in any box other than at the output and extra stages are both of the same 
(even or odd) parity. 

The network we have depicted connects a set of input nodes to a set of output 
nodes. The input and output nodes may be the same nodes, in which case node i 
provides data at line i of the input side and obtains data from line i of the output 
side. When the sets are disjoint (e.g., a set of processors is connected to a set of 
memory modules), we can have two networks, one in each direction. Figure 4.4 
illustrates these configurations. 




Extra Stage Stage 2 Stage 1 


FIGURE 4.3 An 8 x 8 extra-stage butterfly network. 
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FIGURE 4.4 Two possible configurations for multistage networks. 

Analysis of the Butterfly Network 

In what follows we analyze the resilience of a fc-stage butterfly interconnection 
network that connects N — 2 k processors to N = 2 k memory units in a shared- 
memory architecture. 

Let us start by deriving the bandwidth of this network in the absence of fail¬ 
ures. The bandwidth in this context is defined as the expected number of access 
requests from the processors that reach the memory modules. We will assume that 
every processor generates in each cycle, with probability p r , a request to a mem¬ 
ory module. This request is directed to any of the N memory modules with equal 
probability, 1/N. Hence, the probability that a given processor generates a request 
to a specific memory module i (i e {0,1,... ,N — 1}) is p r /N. For simplicity, assume 
that each processor makes a request that is independent of its previous requests. 
Even if its previous request was not satisfied, the processor will generate a new, 
independent request. This is obviously an approximation: in practice, a processor 
will repeat its request until it is satisfied. 

Because of the symmetry of the butterfly network and our assumption that all N 
processors generate requests to all N memories in a uniform fashion, all N output 
lines of a stage, say stage i, will carry a memory request with the same probability. 
Let us denote this probability by p[‘\ i = 0,1,... ,k— 1. We calculate this probability 
stage by stage, starting at the inputs (processors) where i — k—1 and working our 
way to the outputs (memories) where i = 0. 

Starting from i — k—1, the memory requests of each processor (at a probability 
of p r ) will, on the average, be equally divided between the two output lines of 
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the switchbox to which the processor is connected. That is, the probability that a 
certain output line of a switchbox at stage (k — 1) will carry a request generated 
by one of the two processors is p r / 2. Because a request on that output line can be 
generated by either of the two processors, pf 11 is the probability of the union of 
the two corresponding events (each with probability p r / 2). Using the basic laws 
of probability, we can write 


pf~ l) - 


El 

2 



Pr 


Using a similar argument to derive an expression for p[‘ when given p\E yields 
the following recursive equation: 


(/—1) (i) 

Pr ~Pr 



Here, too, we rely of the statistical independence of the requests carried by the two 
input lines to a switchbox, since the two routes they traverse are disjoint. 

The bandwidth of the network is the expected number of requests that make it 
to the memory end, which is 

BW = Np^ (4.1) 

This approach can be extended to nonsymmetric access patterns, in which differ¬ 
ent memory modules are requested with differing probabilities. 

We can now extend this analysis to include the possibility of faulty lines. As¬ 
sume that a faulty line acts as an open circuit. For any link, let qi be the probability 
that it is faulty and pi = 1 — qt the probability that it is fault-free. Note that we 
have omitted the dependence on time to simplify the notation. 

We assume that the failure probability of a switchbox is incorporated into that 
of its incident links, and thus, in what follows we assume that only links can fail. 
The probability that a request at the input line to a switchbox at stage (z — 1) will 
propagate to one of the corresponding outputs in stage i is pi p\E /2. The resulting 
recursive equation is therefore 


Pr 11 =PtpV -{PePr’Y / 4 


It) 


(*)\2 


Setting pf ] = p r , we now calculate p^ recursively, and substitute it in Equation 4.1. 

Let us now turn to calculating the expected number of connected processor- 
memory pairs in a k- stage, 2 k x 2 k network, which we call network connectability. 
We are focusing here on the properties of the network and not on the health of 
the processors and memories. There are k + 1 links and k switchboxes that need to 
be traversed in a k-stage network. We make here a distinction between switchbox 
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failures and link failures and denote by q s the probability that a switchbox fails 
(p s = 1 — q s ). Because links and switchboxes are assumed to fail independently, 
and all k + 1 links and all k switchboxes on the input-output path must be up for 
a given processor-memory pair to be connected, the probability that this happens 
is (1 — qi) k+1 (l — q s ) k — Pf +1 Ps ■ Since there are 2 2k input-output pairs, the expected 
number of pairs that are connected is given by 

Q=2 2k P k +y s 


The network connectability measure does not provide any indication as to how 
many distinct processors and memories are still accessible. We say that a processor 
is accessible if it is connected to at least one memory; an accessible memory is defined 
similarly To calculate the number of accessible processors, we obtain the probabil¬ 
ity that a given processor is able to connect to any memory For this calculation, we 
again confine ourselves to link failures and assume that switchboxes do not fail. 
We can calculate this probability recursively, starting at the output stage. Denote 
by <p(i ) the probability that at least one fault-free path exists from a switchbox in 
stage i to the output end of the network. 

Consider 0(0). This is the probability that at least one line out of a switchbox at 
the output stage is functional: this probability is 1 — q 2 . 

Consider 0(z), i > 0. From any switchbox in stage i, we have links to two switch- 
boxes in stage (i — 1). Consider the top outgoing link. A connection to the output 
end exists through this link if and only if that link is functional and the stage- 
(i — 1) switchbox that it leads to is connected to the output end. The probability 
of this is pi cp(i — 1). Since the two outgoing links from any switchbox are part of 
link-disjoint paths to the output end, the probability of a stage-/ switchbox being 
disconnected from the output end is (1 — pi 0(z — l)) 2 . Flence, the probability that 
it is not disconnected is given by 

0(0 = 1 — (1 — pc 0 ( 1 - l )) 2 

The probability that a given processor can connect to the output end is given by 
pi 0(/c). Since there are 2 k processors, the expected number of accessible processors 
that can connect to at least one memory, denoted by A c , is thus 

A c = 2 k pi 0(k) 

The butterfly network is symmetric, and so this is also the expression for the ex¬ 
pected number of accessible memories. 

In this analysis, we have focused on link failures and ignored switchbox fail¬ 
ures. As an exercise, we leave to the reader the task of extending the analysis by 
accounting for the possibility of switchbox failures. 
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Analysis of the Extra-Stage Network 

The analysis of the nonredundant network was simplified by the independence 
between the two inputs to any switch. The incorporation of redundancy (in the 
form of additional switchboxes in the extra stage) into the multistage interconnec¬ 
tion network in Figure 4.3, resulting in two (or more) paths connecting any given 
processor-memory pair, introduces dependency among the links. The analysis is 
further complicated by the existence of the bypass multiplexers at the input and 
output stages. We will therefore not present here the derivation of an expression 
for the bandwidth of the extra-stage network. A pointer to such analysis is pro¬ 
vided in the Further Reading section. 

The derivation of an expression for the network connectability Q is, however, 
relatively simple and will be presented next. As in the previous section, Q is ex¬ 
pressed as the expected number of connectable processor-memory pairs. We first 
have to obtain the probability that at least one fault-free path between a given 
processor-memory pair exists. 

Each processor-memory pair in the extra-stage network is connected by two 
disjoint paths (except for both ends), hence 

ProbfAt least one path is fault-free} 

= ProbfFirst path is fault-free} + ProbjSecond path is fault-free} 

— P rob {Both paths are fault-free} (4.2) 

This probability can assume one of the following two expressions (see, for ex¬ 
ample, the paths connecting processor 0 to memory 0 and the paths connecting 
processor 0 to memory 1 in Figure 4.3): 

A = (i - 1 - ti) + p \ +2 - pf^i 1 - ti ) 2 

B — 2(1 — tf)p\ +1 ~P? +2 ( 1 - 

where (1 — qf) is the probability that, for a switchbox with a bypass multiplexer, 
at least one out of the original horizontal link and its corresponding bypass link is 
operational. Since there are 2 k+1 pairs, we can now write 

Q = (A + B) 2 k+1 /2 = (A + B)2 k 


4.2.2 Crossbar Networks 

The structure of a multistage network limits the communication bandwidth be¬ 
tween the inputs and outputs. Even if the processors (connected to the network 
inputs) attempt to access different memories (connected to the network outputs), 
they sometimes cannot all do so owing to the network's limitations. For example. 
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OUTPUT 



(a) Not fault tolerant (b) Fault tolerant 


FIGURE 4.5 A 3 x 4 crossbar. 


if processor 0 (in Figure 4.2) is accessing memory 0, processor 4 is unable to access 
any of the memories 1, 2, or 3. A crossbar, shown in Figure 4.5a, offers a higher 
bandwidth. As can be seen from Figure 4.5, if there are N inputs and M outputs, 
there is one switchbox associated with each of the NM input/output pairings. In 
particular, the switchbox in row i and column j is responsible for connecting the 
network input on row i to the network output on column j: we call this the ( i,j ) 
switchbox. 

Each switchbox is capable of doing the following: 

■ Forward a message incoming from its left link to its right link (i.e., propa¬ 
gate it along its row). 

■ Forward a message incoming from its bottom link to its top link (i.e., prop¬ 
agate it along its column). 

■ Turn a message incoming from its left link to its top link. 

Each link is assumed to be able to carry one message; each switchbox can 
process up to two messages at the same time. For example, a switchbox can be 
forwarding messages from its left to its right link at the same time as it forwards 
messages from its bottom link to its top link. 

The routing strategy is rather obvious. For example, if we want to send a mes¬ 
sage from input 3 to output 5, we will proceed as follows. The input will first 
arrive to switchbox (3,1), which will forward it to (3,2) and so on, until it reaches 
switchbox (3,5). This switchbox will turn the message into column 5 and forward 
it to box (2,5), which will send it to box (1,5), which will send it to its destination. 
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It is easy to see that any input-output combination can be realized as long as 
there is no collision at the output (no two inputs are competing for access to the 
same output line). 

The higher bandwidth that results from this is especially desirable when both 
inputs and outputs are connected to high-speed processors, rather than relatively 
slow memories. This higher performance comes at a price: as mentioned above, an 
N x M crossbar with N inputs and M outputs needs NM switchboxes, whereas an 
N x N multistage network (where N — 2 k ) requires only 4 log 2 N switchboxes. 

It is obvious from Figure 4.5a that the crossbar is not fault tolerant: the failure 
of any switchbox will disconnect certain input-output pairs. Redundancy can be 
introduced to make the crossbar fault tolerant: an example is shown in Figure 4.5b. 
We add a row and a column of switchboxes and augment the input and output 
connections so that each input can be sent to either of two rows, and each output 
can be received on either of two columns. If any switchbox becomes faulty, the 
row and column to which it belongs are retired, and the spare row and column are 
pressed into service. 

The connectability of the crossbar (the original structure and the fault-tolerant 
variation) can be analyzed to identify its dependence on the failure probabilities 
of the individual components. We demonstrate next the calculation of the con¬ 
nectability Q of the original crossbar, using the same assumptions and notation 
as for the multistage network. We assume that processors are connected to the in¬ 
puts and memories to the outputs. As before, assume that c]t is the probability that 
a link is faulty, pe = 1 — tjt, and the switchboxes are fault-free. The probability of 
switchbox failures can be taken into account, if necessary, by suitably adjusting the 
link failure probabilities. Counting from 1, for input i to be connectable to output 
j, we have to go through a total of i +j links. The probability that all of them are 

fault-free is p' t ] . Hence, 


N M 


Q = =v\ yz 


1 -p? 1-tf 


M 


i=l /=! 


Vt 1 - Pi 


(4.3) 


Calculating Q for the fault-tolerant crossbar and the bandwidth for both designs 
is more complicated and is left as an exercise for the interested reader. 


4.2.3 Rectangular Mesh and Interstitial Mesh 

The multistage and crossbar networks discussed above are examples of networks 
constructed out of switchboxes and links and connecting a set of input nodes to 
a set of output nodes. A two-dimensional NxM rectangular mesh network is a 
simple example of a network topology in which all the nodes are computing nodes 
and there are no separate switchboxes (see Figure 4.6). Most of the NM computing 
nodes (except the boundary nodes) have four incident links. To send a message to 
a node that is not an immediate neighbor, a path from the source of the message to 
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FIGURE 4.6 A 4 x 6 mesh network. 

its destination must be identified and the message has to be forwarded by all the 
intermediate nodes along that path. 

A conventional two-dimensional rectangular mesh network is unable to tol¬ 
erate any faults in any of its nodes without losing the mesh property (that each 
internal node has four neighbors). We can introduce redundancy into the network 
and provide some tolerance to failures; one approach is shown in Figure 4.7. The 
modified mesh includes spare nodes that can be switched in to take the place of 
any of their neighbors that have failed. The scheme shown in Figure 4.7 is called 
(1,4) interstitial redundancy. In this scheme, each primary node has a single spare 
node, while each spare node can serve as a spare for four primary nodes: the re¬ 
dundancy overhead is 25%. The main advantage of the interstitial redundancy is 
the physical proximity of the spare node to the primary node which it replaces, 
reducing in this way the delay penalty resulting from the use of a spare. 

Another version of interstitial redundancy is shown in Figure 4.8. This is an 
example of a (4,4) interstitial redundancy in which each primary node has four 
spare nodes and each spare node can serve as a spare for four primary nodes. This 
scheme provides a higher level of fault tolerance at the cost of a higher redundancy 
overhead of almost 100%. 

Let us now turn to the reliability of meshes. We will focus on the case in which, 
as mentioned above, nodes are themselves processors engaging in computation, 
in addition to being involved in message-passing. In the context of this dual role 
of processors and switches, reliability no longer means just being able to commu¬ 
nicate from one entry point of the network to another; it means instead the ability 
of the mesh, or a subset of it, to maintain its mesh property. 

The algorithms that are executed by mesh-structured computers are often de¬ 
signed so that their communication structure matches that of the mesh. For ex- 
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| | primary node 

\ \ spare node 


FIGURE 4.7 A mesh network with (1,4) interstitial redundancy. 



| | primary node 

j j spare node 


FIGURE 4.8 A mesh network with (4,4) interstitial redundancy. 


ample, an iterative algorithm designed for mesh structures and used to solve the 
differential equation (for some function f(x,y)) 

d 2 f(x,y) d 2 f(x,y) 
dx 2 dij 2 

requires that each node average the values held by its neighbors. Thus, if the mesh 
structure is disrupted, the system will not be able to efficiently carry out such 
mesh-structured computations. It is from this point of view that the reliability of 
the mesh is defined as the probability that the mesh property is retained. 

The reliability of the (1,4) interstitial scheme can be evaluated as follows. Let 
R(t ) be the reliability of every primary or spare node, and let the mesh be of 
size NxM with both N and M even numbers. In such a case, the mesh contains 
NxM/4 clusters of four primary nodes with a single spare node. The reliability 
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of a cluster, assuming that all links are fault-free, is 

Kdu S ter(f) = R 5 (f) + 5K 4 (f)(l-i?(f)) 
and the reliability of the N x M interstitial mesh is 

Rim(0 = ( R 5 (t ) + 5K 4 (f)[l - R(t)]) NM/4 

This should be compared to the reliability of the original N xM mesh, which under 
the same assumptions is R me sh(0 = R NM (t). The assumption of fault-free links can 
be justified, for example, in the case in which redundancy is added to each link, 
making the probability of its failure negligible compared to that of a computing 
node. 

Other measures of dependability can be defined for the mesh network (or its 
variations). For example, suppose that an application that is about to run on the 
mesh requires an n x m submesh for its execution where n < N and m < M. In 
this case, the probability of being able to allocate an n x m fault-free submesh out 
of the N x M mesh in the presence of faulty nodes is of interest. Unfortunately, 
deriving a closed-form expression for this probability is very difficult because of 
the need to enumerate all possible positions of a fault-free n x m submesh within 
anNxM mesh with faulty nodes. Such an expression can, however, be developed 
if the allocation strategy of submeshes is restricted. For example, suppose that only 
nonoverlapping submeshes within the mesh can be allocated. This strategy limits 
the number of possible allocations to k— [ 77 J x L 77 J places. This now becomes a 
1-of-fc system (see Chapter 2), yielding 

Prob{A fault-free n x m submesh can be allocated) = 1 — [l — R ,,m (f)] fc 

where R(t) is the reliability of a node. If nodes can be repaired, the availability 
is the more suitable measure. A Markov chain can be constructed to evaluate the 
availability of a node and, consequently, of a certain size submesh. 

4.2.4 Hypercube Network 

A hypercube network of n dimensions, H n , consists of 2” nodes and is constructed 
recursively as follows. A zero-dimension hypercube. Ho, consists of just a single 
node. H n is constructed by taking two H„_ 1 networks and connecting their cor¬ 
responding nodes together. The edges that are added to connect corresponding 
nodes in the two H „_\ networks are called dimension-(n — 1) edges. Figure 4.9 
shows some examples of hypercubes. 

A node in a dimension-^ hypercube has n edges incident upon it. Sending a 
message from one node to another is quite simple if the nodes are named (num¬ 
bered) in the following way. When the name is expressed in binary and nodes i 
and j are connected by a dimension-k edge, the names of i and j differ in only the 
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(b) H 2 (c) H 3 (d) H 3 



(e)H 4 


Dimension 0 . Dimension 2 

Dimension 1 - Dimension 3 


FIGURE 4.9 Hypercubes. 


kth-bit position. Thus, we know that because nodes 0000 and 0010 differ in only 
bit position 1 (the least significant bit is in position 0 ), they must be connected by 
a dimension -1 edge. 

This numbering scheme makes routing straightforward. Suppose a message has 
to travel from node 14 to node 2 in an H 4 network. Because 14 is 1110 in binary 
and 2 is 0010 , the message will have to traverse one edge each in the dimensions in 
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which the corresponding bit positions differ, which are dimensions 2 and 3. Thus, 
if it first travels from node 1110 on a dimension-3 edge, it arrives at node 0110. 
Leaving this node on a dimension-2 edge, the message arrives at its destination, 
0010. Clearly, another alternative is to go first on a dimension-2 edge arriving at 
1010 and then on a dimension-3 edge to 0010. 

More generally, if X and Y are the node addresses of the source and destination 
in binary, then the distance between them is the number of bits in which their 
addresses differ. Going from X to Y can be accomplished by traveling once along 
each dimension in which they differ. More precisely, let X = x n -\ ■ • - Xq and Y = 
y n - 1 - - - yo - Define z, = x t © i/;, where © is the XOR operator. Then, the message 
must traverse an edge in every dimension i for which Zj — 1. Thus, Z = z w _i ■ • -zo 
is a routing vector, which specifies which dimension edges have to be traversed in 
order to get to the destination. 

H n (for n 2) can clearly tolerate link failures because there are multiple paths 
from any source to any destination. However, node failures can disrupt the oper¬ 
ation of a hypercube network. Several ways of adding spare nodes to a hypercube 
have been proposed. One way is to increase the number of communication ports 
of each node from n to (n + 1 ) and connect these extra ports through additional 
links to one or more spare nodes. For example, if two spare nodes are used, each 
will serve as a spare for 2 " _1 nodes, which are the nodes in an H n -\ subcube. Such 
spare nodes may require a large number of ports, namely, 2" -1 . This number of 
ports can be reduced by using several crossbar switches, the outputs of which will 
be connected to the corresponding spare node. The number of ports of the spare 
node can thus be reduced to n + 1 , which will also be the degree of all other nodes. 
Figure 4.10 shows an H 4 hypercube with two spare nodes and with all 18 nodes 
having five ports. 

Another way of incorporating node redundancy into the hypercube is by dupli¬ 
cating the processor in a few selected nodes. Each of these additional processors 
can serve as a spare, not only for the processor within the same node but also for 
any of the processors in the neighboring nodes. For example, nodes 0, 7, 8 , and 15 
in H 4 (see Figure 4.9e) can be modified to duplex nodes so that every node in the 
hypercube has a spare at a distance no larger than 1. In this as well as in the previ¬ 
ous redundancy scheme, the replacement of a faulty processor by a spare proces¬ 
sor will result in an additional communication delay that will be experienced by 
any node communicating with a spare node. 

We now show how to calculate the reliability of this network. Assuming that the 
nodes and links fail independently of one another, the reliability of the H n hyper¬ 
cube is the product of the reliabilities of the 2 " nodes and the probability that every 
node can communicate with every other node despite possible link failures. Since, 
for even moderately large n, multiple paths connect every source-destination pair 
in H n , an exact evaluation of the latter probability would require a substantial enu¬ 
meration. 

Let us instead show how to obtain a good lower bound on the network reliabil¬ 
ity. We will start by assuming that the nodes are perfectly reliable: this will allow 
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FIGURE 4.10 A hypercube with spare nodes. 


us to focus on link failures. Once the network reliability is obtained under this as¬ 
sumption, we can then introduce node failures by multiplying by the probability 
that all the nodes are functional. 

Denote by q c and q f the probability of a failure (before time f) of a node and 
a link, respectively (recall that t is omitted for expression simplicity). Denote the 
network reliability of H n under these conditions by NR (H n , q f , q c ). Throughout we 
assume that the failures of individual components are independent of one another. 

Our lower bound calculation will consist of listing three cases, under each of 
which the network is connected. These cases are mutually exclusive; we will add 
their probabilities to obtain our lower bound. 

Our approach exploits the recursive nature of the hypercube. H„ can be re¬ 
garded as two copies of H„_i, with corresponding nodes connected by a link. Let 
us therefore decompose H n in this way into two H„_i hypercubes, A and B ; H„ 
consists of these two networks plus d imension-(/i — 1) links (the link dimensions 
of H n are numbered 0 to n — 1). We then consider the following three mutually 
exclusive cases, each of which results in a connected H n . Keep in mind that we 
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are assuming q c = 0 to begin with. Also, when we say that a particular network is 
operational, we mean that all its nodes are functional and it is connected. 

Case 1. Both A and B are operational and at least one dimension-(/i — 1) link is 
functional. 

ProbfCase 1} = [NR(H„_i,^,0)] 2 (1 - qf~) 

Case 2. One of {A, B} is operational and the other is not. All dimension-(« — 1) 
links are functional. 

ProbfCase 2} =2NR(H„_i,^,0)[l - NR(H„_i,^,0)](l - qif^ 

Case 3. One of {A,B} is operational and the other is not. Exactly one dimension- 
(n — 1) link is faulty. This link is connected in the nonoperational H „_\ to a node 
that has at least one functional link to another node. 

ProbfCase 3} = 2NR(H„_i,^,0)[l - NR(H„_i,^,0)] 

x - qe f' ‘ _1 (1 - q n t ~ l ) 

In the Exercises, you are asked to show that each of these cases results in a 
connected H n and that the cases are mutually exclusive. 

We therefore have 

NR(H„,^ £ ,0) = ProbfCase 1} + ProbfCase 2} + ProbfCase 3} 

The base case is hypercubes of dimension 1: such a system consists of two nodes 
and one link, yielding 

NR{H lr qi,0) = l - qi 

We may also start with a hypercube of dimension 2, for which 
NR(H 2 ,^,0) = (1 - qit + 4qt( 1 - q t ) 3 

Finally, we consider the case q c ^ 0. From the definition of network reliability, it 
follows immediately that 

NR(H„, q lf q c ) = ( 1 - q c ) 2 " NR(H„, q ( , 0) (4.4) 


4.2.5 Cube-Connected Cycles Networks 

The hypercube topology has multiple paths between nodes and a low overall di¬ 
ameter of n for a network of 2” nodes. However, these are achieved at the price 
of a high node degree. A node must have n ports, which implies that a new node 
design is required whenever the size of the network increases. An alternative is 
the Cube-Connected Cycles (CCC) which keeps the degree of a node fixed at three 
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FIGURE 4.11 A CCC(3,3) (cube-connected cycles) network. 


or less. A CCC network that corresponds to the H 3 hypercube (see Figure 4.9d) is 
shown in Figure 4.11. Each node of degree three in H 3 is replaced by a cycle con¬ 
sisting of three nodes. In general, each node of degree n in the hypercube H n is 
replaced by a cycle containing n nodes where the degree of every node in the cycle 
is 3. The resulting CCC(n,n) network has n2" nodes. In principle, each cycle may 
include k nodes with k ^ n with the additional k — n nodes having a degree of 2 . 
This will yield a CCC(n,k) network with 1(2" nodes. The extra nodes of degree 2 
have a very small impact on the properties that are of interest to us, and we will 
therefore restrict ourselves to the case k = n. 

By extending the labeling scheme of the hypercube, we can represent each node 
of the CCC by ( i-j ), where i (an n- bit binary number) is the label of the node in the 
hypercube that corresponds to the cycle and j (0 ^ j < n — 1 ) is the position of the 
node within the cycle. Two nodes, ( i;j ) and (/'; /'), are linked by an edge in the CCC 
if and only if either 

1 . i — i' and j — f = ±1 mod n, or 

2. j — '/ and i differs from i' in precisely the jth bit. 

The former case is a link along the cycle and the latter corresponds to the 
dimension-/ edge in the hypercube. For example, nodes 0 and 2 in H 3 (see Fig¬ 
ure 4.9d) are connected through a dimension-1 edge that corresponds to the edge 
connecting nodes (0,1) and (2,1) in Figure 4.11. 

The lower degree of nodes in the CCC compared to the hypercube results in a 
bigger diameter. Instead of a diameter of size n for the hypercube, the diameter of 
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FIGURE 4.12 A 15-node chordal network with a skip distance of 3. 


the CCC(n,n) is 


2 n + 


n 

2 


-2 


2.5 n 


The routing of messages in the CCC is also more complicated than that in hy¬ 
percubes (discussed in Section 4.3.1). The fault tolerance of the CCC is, however, 
higher because the failure of a single node in the CCC will only have an effect sim¬ 
ilar to that of a single faulty link in the hypercube. A closed form expression for 
the reliability of the CCC has not yet been derived. 


4.2.6 Loop Networks 

The cycle topology (also called loop network) that is replicated in the CCC network 
can serve as an interconnection network with the desirable properties of a simple 
routing algorithm and a small node degree. However, an n-node loop with all its 
edges unidirectional has a diameter of n — 1, which means that a message from 
one node to the other will, on the average, have to be relayed by n/2 intermediate 
nodes. Moreover, a unidirectional loop network is not fault tolerant; a single node 
or link failure will disconnect the network. 

To reduce the diameter and improve the fault tolerance of the loop network, 
extra links can be added. These extra links are called chords, and one way of adding 
these unidirectional chords is shown in Figure 4.12. Each node in such a chordal 
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network has an additional backward link connecting it to a node at a distance s, 
called the skip distance. Thus, node i (0 < i < n — 1) has a forward link to node 
(i + 1) mod n and a backward link to node (i — s) mod n. The degree of every node 
in this chordal network is 4 for any value of n. 

Different topologies can be obtained by varying the value of s, and we can se¬ 
lect s so that the diameter of the network is minimized. To this end, we need an 
expression for the diameter, denoted by D, as a function of the skip distance s. 
The diameter is the longest distance that a message must traverse from a source 
node i to a destination node /: it obviously depends on the routing scheme that is 
being used. Suppose we use a routing scheme that attempts to reduce the length 
of the path between i and j by using the backward chords (that allow skipping of 
intermediate nodes) as long as this is advantageous. If we denote by b the num¬ 
ber of backward chords that are being used, then the number of nodes skipped is 
bs. If the maximum value of b, denoted by V, is reached, then the use of an addi¬ 
tional backward chord will take us back to the source i (or even further). Thus, b' 
should satisfy b's + b' ^ n. To calculate the diameter D, we therefore use b' back¬ 
ward chords, where 



To these b’ links, we may need to add a maximum of s — 1 forward links, and thus, 

D — -At + (s — 1) (4.5) 

_s + 1_ 

We wish now to find a value of s that will yield a minimal D. Depending upon the 
value of n, there may exist several values of s that minimize D. The value s = l^/n] 
is optimal for most values of n yielding D op t ~ 2 «Jn — 1. For example, if n = 15 as 
in Figure 4.12, the optimal s that minimizes the diameter D is s = [VT5J = 3 (the 
value that is used in the figure). The corresponding diameter is D = J +2 = 5. 

Analyzing the improvement in the reliability/fault tolerance of the loop net¬ 
work as a result of the extra chords is quite complicated. We can instead calculate 
the number of paths between the two farthest nodes in the network. If this num¬ 
ber is maximized, it is likely that the reliability is close to optimal. We focus on 
the paths that are of the same length and consist of b’ backward chords and (s — 1) 
forward links but use the backward chords and forward links in a different order. 
The number of such paths is 

(V’r 1 ) 

If we search for a value of s that will maximize the number of alternative paths of 
the minimum length between the two farthest nodes, we get s = fy/i ] . Flowever, 
for most values of n, s = Lv^hl also yields the same number of paths. In summary, 
we conclude that in most cases, the value of s that minimizes the diameter also 
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FIGURE 4.13 A four-node network. 

maximizes the number of alternate paths and thus improves the reliability of the 
network. 

4.2.7 Ad Hoc Point-to-Point Networks 

The interconnection networks that we have considered so far have regular struc¬ 
tures and the resulting symmetry greatly simplified the analysis of their resilience. 
The computing nodes in a distributed computer system are quite often intercon¬ 
nected through a network that has no regular structure. Such interconnection net¬ 
works, also called point-to-point networks, have typically more than a single path 
between any two nodes, and are therefore inherently fault tolerant. For this type 
of network, we would like to be able to calculate the path reliability, defined as the 
probability that there exists an operational path between two specific nodes, given 
the various link failure probabilities. 


■ EXAMPLE 

Figure 4.13 shows a network of five directed links connecting four nodes. 
We are interested in calculating the path reliability for the source-destination 
pair N\ — N 4 . The network includes three paths from Ni to 1 V 4 , namely, 
P 1 = {*i,2 ,X2,4}, Pi = {xi,3,*3,4} and P 3 = {* 1 , 2 ,X 2 , 3 ,* 3 , 4 }- Let Phi denote the 
probability that link x h j is operational and define q h j — 1— p h j. (Flere too we 
omit the dependence on time to simplify the notation.) We assume that the 
nodes are fault-free; if the nodes can fail, we incorporate their probability of 
failure into the failure probability of the outgoing links. Clearly, for a path 
from N\ to N 4 to exist, at least one of Pi, P 2 , or P 3 must be operational. 
We may not, however, add the three probabilities Prob{P, is operational}, 
because some events will be counted more than once. The key to calculating 
the path reliability is to construct a set of disjoint (or mutually exclusive) events 
and then add up their probabilities. For this example, the disjoint events that 
allow N 1 to send a message to N 4 are (a) P 1 is up, (b) P 2 is up but Pi is down. 
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and (c) P 3 is up but both P\ and P 2 are down. The path reliability is thus 
RNi,N 4, — Pl,2P2,4 + Pl,3P3,4[l — £1,2)92,4] + Pi, 2£2,3£3,4[</l,3<?2,4] 


For this simple network, it is relatively easy to identify the links that must be 
faulty so that the considered paths are down and the events become disjoint. In 
the general case, however, the identification of such links can be very complicated, 
and using the inclusion and exclusion probability formula, detailed next, becomes 
necessary. 

Suppose for a given source-destination pair, say N s and Nj, m paths 
Pi,P 2 ,... ,P m exist from the source to the destination. Denote by E, the event in 
which path P, is operational. The expression for the path reliability is 


P-N s ,Nd ~ ProbjEi U £2 U • • • U E m } (4.6) 

The events £ 1 ,..., E m are not disjoint, but they can be decomposed into a set of 
disjoint events as follows: 

£1 u E2 u • • • u E m = £1 u (£2 n£i)u (£3 n Ex n £2) u • • • u (E m n E\ n £2 n • • • n £,„_ 1) 

_ (4.7) 

where E; denotes the event that path P, is faulty. The events on the right hand side 
of Equation 4.7 are disjoint, and their probabilities can therefore be added to yield 
the path reliability: 

R.N s ,N d = ProbjEi) + Prob{£2 (T Ei} H-b Prob{£,„ fl Ej fl £2 fl • • • D £,„_i} (4.8) 

This expression can be rewritten using conditional probabilities 


R N s ,N d — Prob{Ei) + Prob{£ 2 } ProbjEi | £ 2 } H- 

+ Prob{£„,} ProbjEi (T £2 (T • • • D £„,_i | E,, ( } (4.9) 

The probabilities ProbjE/} are easily calculated. The difficulty is in calculating the 
probabilities ProbjEi D - - - fl E/_ 1 1 £/}. We can rewrite the latter as ProbjEi |/ fl - - - D 
Ej_l|;}, where E, |, is the event in which P/ is faulty given that P, is operational. 
To identify the links that must fail so that the event Eq, occurs, we define the 
conditional set 

Pj\i = Pj - Pi = {Xk I x k e Pj and x k g P,} 

We will illustrate the use of these equations through the following example. 
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FIGURE 4.14 A six-node network. 


■ EXAMPLE 

The six-node network shown in Figure 4.14 has nine links, out of which six are 
unidirectional and three bidirectional. We are interested in calculating the path 
reliability for the pair /Vf-Ng. The list of paths leading from N\ to N& includes 
the following: 

Pi = {*1,3/*3,5/*5,6} 

Pi = {*1,2/*2,5/*5,6) 

P3 — {*1,2/*2,4/*4,6} 

P 4 = {*1,3/ *3,5/ *4,5/ *4,6} 

P5 — {*1,3/*2,3/*2,4/*4,6} 

P 6 = {*1,3/ *2,3/ *2,5/ *5,6} 

P7 — {* 1 , 2 / *2,5/ *4,5/ *4,6} 

Note that these paths are ordered so that the shortest ones are at the top and 
the longest ones at the bottom. This simplifies the calculation of the path reli¬ 
ability, as will become apparent next. 

The conditional set Pi |2 is Pi 12 — P\ — P 2 — {*i, 3 /* 3 ,s}- The set {* 1 , 3 ,* 3 , 5 } 
must fail in order for Pi to be faulty while P 2 is working. The second term in 
Equation 4.9 corresponding to Po will thus be pi,2P2,5P5,6(l — Pl,3P3,s)- 

For calculating the other terms in Equation 4.9, the intersection of several 
conditional sets must be considered. For example, for P 4 the conditional sets 
are Pq 4 = {* 5 , 6 }/ T 214 = {* 1 , 2 ,* 2 , 5 /* 5 , 6 }/ and P 3 | 4 = {xi, 2 ,* 2 , 4 h Because P 214 will 
fail when P 4 14 fails, we can discard P 214 and focus on Pi 14 and P 314 . Both Pi 
and P 3 must be faulty while P4 is working. The fourth term in Equation 4.9 
corresponding to P4 will therefore be pi, 3 p3,5P4,5p 4 ,6(l — P5,6)(l — Pl,2P2,4)- 
A more complicated situation is encountered when calculating the third 
term in Equation 4.9 for P 3 . Here, Pq 3 = {*i, 3 ,* 3 ,5/*5,6}/ ^2 1 3 = {*2,5/*5,6b and 


P 8 = {*1,2/*2,3/*3,5/*5,6} 

P 9 = {*1,2/*2,4/*4,5/*5,6} 

PlO = {*1,3/*2,3/*2,4/*4,5/*5,6} 
Til = {*1,3/*2,3/*2,5/*4,5/*4,6} 
P 12 = {*1,3/*3,5/*2,5/*2,4/*4,6} 
P13 = {*1,2/*2,3/*3,5/*4,5/*4,6} 
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the two conditional sets are not disjoint. Both Pi and P 2 will be faulty if one 
of the following disjoint events occur: ( 1 ) Xs,6 is faulty, ( 2 ) X'^,6 is working 
and either x\$ is faulty and *2,5 is faulty, or X 13 is working, 3:3,5 is faulty 
and *2,5 is faulty The resulting expression is pi,2/22,4/24,6 [<?5,6 + P5,6^l,3^2,5 
+P 5 , 6 Pl, 3 ^ 3 , 5 ^ 2 , 5 ] ■ The remaining terms in Equation 4.9 are similarly calculated 
and the sum of all 13 terms yields the required path reliability, Pjvi,N 6 - ■ 


The alert reader would have noticed the similarity between the calculation of 
the path reliability and the computation of the availability for a given set of read 
and write quorums in a distributed system with data replication that has been 
presented in Section 3.3. Here too, we have a number of components (links), each 
of which can be up or down and we need to calculate the probability that certain 
combinations of such components are up. In the last example we had nine links 
and we can enumerate all 2 9 states and calculate the probability of each state by 
multiplying nine factors of the form p t/ j or q h j. We then add up the probabilities of 
all the states in which a path from node N\ to node Ng exists and thereby obtain 
the path reliability Rni,N 6 ■ 


4.3 Fault-Tolerant Routing 

The objective of a fault-tolerant routing strategy is to get a message from source to 
destination despite a subset of the network being faulty. The basic idea is simple: 
if no shortest or most convenient path is available because of link or node failures, 
reroute the message through other paths to its destination. 

The implementation of fault tolerance depends on the nature of the routing al¬ 
gorithm. In this section, we will focus on unicast routing in distributed computing. 
In a unicast, a message is sent from a source to just one destination. The problem 
of multicast, in which copies of a message are sent to a number of nodes, is an 
extension of the unicast problem. 

Routing algorithms can be either centralized or distributed. Centralized routing 
involves having a central controller in the network, which is aware of the current 
network state (which links or nodes are up and which are down; which links are 
heavily congested) and lays out for each message the path it must take. A variation 
on this is to have the source act as the controller for that message and specify its 
route. In distributed routing, there is no central controller: the message is passed 
from node to node, and each intermediate node decides which node to send it to 
next. 

The route can be chosen either uniquely or adaptively. In the former approach, 
just one path can be taken for each source-destination pair. For instance, in a rec¬ 
tangular mesh, the message can move in two dimensions: horizontal and vertical. 
The rule may be that the message has to move along the horizontal dimension 
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until it is in the same column as the destination node, whereupon (if it is not al¬ 
ready at the destination) it turns and moves vertically to reach the destination. In 
an adaptive approach, the path can be varied in response to network conditions. 
For instance, if a particular link is congested, the routing policy may avoid using 
it if at all possible. 

Implementing fault tolerance in centralized routing is not difficult. A central¬ 
ized router that knows the state of each link can use graph-theoretic algorithms 
to determine one or more paths that may exist from source to destination. Out of 
these, some secondary considerations (such as load balancing or number of hops) 
can be used to select the path to be followed. 

In the rest of this section, we present routing approaches for two of the struc¬ 
tures we have encountered before: the (/-dimensional hypercube and the rectan¬ 
gular mesh. 

4.3.1 Hypercube Fault-Tolerant Routing 

Although the hypercube network can tolerate link failures, we still must modify 
the routing algorithm so that it continues to successfully route messages in injured 
hypercubes, i.e., hypercubes with some faulty nodes or links. The basic idea is to 
list the dimensions along which the message must travel and then traverse them 
one by one. As edges are traversed, they are crossed off the list. If, because of a link 
or a node failure, the desired link is not available, then another edge in the list, if 
any, is chosen for traversal. If no such edges are available (the message arrives at 
some node to find that all dimensions on its list are down), it backtracks to the 
previous node and tries again. 

Before writing out the algorithm, we introduce some notation. Let TD denote 
the list of dimensions that the message has already traveled on, in the order in 
which they have been traversed. TD R is the list TD reversed. denotes the 

XOR operation carried out k times, sequentially. For example, ©f =1 means 

(i a\ © « 2 ) © « 3 - If D is the destination and S the source, let d = D © S, where © is a 
bitwise XOR operation on D and S. In general, x © y is called the relative address of 
node x with respect to node y. Let SR (A) be the set of relative addresses reachable 
by traversing each of the dimensions listed in A, in that order. For example, if we 
travel along dimensions 1,3,2 in a four-dimensional hypercube, the set of relative 
addresses reachable by this travel would be: 0010,1010,1110. Denote by e l n the n -bit 
vector consisting of a 1 in the /th-bit position and 0 everywhere else, for example, 
4 = 010 . 

Messages are assumed to consist of (a) d: the list of dimensions that must be 
traversed from S to D, (b) the data being transmitted (the "payload"), and (c) TD: 
the list of dimensions taken so far. 

By TRANSMIT^') we mean "send the message ( d © e>, payload, TD © j) along 
the /th-dimensional link from the present node," where O denotes the "append" 
operation (e.g., TD © x means "append x to the list TD"). 
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If (rf==0---0) 

Accept message and Exit algorithm / / Final destination has been reached, 
else 

for j = 0 to (n — 1) step 1 do { 

if ((dj —— 1) && (j th dimension link from this node is nonfaulty) 

&& (4j ^ SR(TD r )) j // Message gets one step closer to its destination. 
TRANSMIT^') 

Exit algorithm 

} 

) 

end if 

//If we are not done at this point, it means there is no way of getting one 
/ / step closer to the destination from this node: we need to take a detour, 
if (there is a non-faulty link not in SR(TD r )) / / there is a link not yet attempted. 

Let li be one such link 
else ( 

Define g = max{m : ©"Ij e== 0 • • • 0) 
if (y—n umber of elements in SR(TD)) { 

Give up / / Network is disconnected and no path exists to destination. 

Exit algorithm 

) 

else 

h = element (g + 1) in TD R / / Prepare to backtrack, 
end if 

TRANSMIT^) 

end 


FIGURE 4.15 Algorithm for routing in hypercubes. 


The algorithm is shown in Figure 4.15. When node V receives a message, the al¬ 
gorithm checks to see if V is its intended destination. If so, it accepts the message, 
and the message's journey is over. If V was not the intended final destination, the 
algorithm checks if the message can be forwarded so that it is one hop (or, equiv¬ 
alently, one dimension) closer to its destination. If this is possible, the message is 
forwarded along the chosen link. If not, we need to take a detour. To take a detour, 
we see if there is a link that this message has not yet traversed from V. If so, we 
send it along such a link (any such link will do: we are trying to move the mes¬ 
sage to some other node closer to the destination). If the message has traversed 
every such link, we need to backtrack and send the message back to the node from 
which V originally received it. If V happens to be the source node itself, then it 
means that the hypercube is disconnected and there is no path from the source to 
the destination. 
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FIGURE 4.16 Routing in an injured hypercube. 


■ EXAMPLE 

We are given an Hj, with faulty node Oil (see Figure 4.16). Suppose node 
S — 000 wants to send a message to D = 111. At 000, d = 111, so it sends the 
message out on dimension-0, to node 001. At node 001, d — 110 and TD = (0). 
This node attempts to send it out on its dimension-1 edge. However, because 
node Oil is down, it cannot do so. Since bit 2 of d is also 1, it checks and finds 
that the dimension-2 edge to 101 is available. The message is now sent to 101, 
from which it makes its way to 111. What if both Oil and 101 had been down? 
We invite the reader to solve this problem. ■ 


How can we be confident that this algorithm will, in fact, find a way of getting 
the message to its destination (so long as a source-to-destination path exists)? The 
answer is that this algorithm implements a depth-first search strategy for graphs, 
and such strategies have been shown to be effective in finding a path if one exists. 

4.3.2 Origin-Based Routing in the Mesh 

The depth-first strategy described above has the advantage of not requiring any 
advance information about which nodes are faulty: it uses backtracking if it ar¬ 
rives at a dead-end. In this section, we describe a different approach, in which 
we assume that the faulty regions are known in advance. With this information 
available, no backtracking is necessary. 

The topology we consider is a two-dimensional rectangular N x N mesh with 
at most N — 1 failures. The procedure can be extended to meshes of dimension 
three or higher, and to meshes with more than N — 1 failures. It is assumed that 
all faulty regions are square. If they are not, additional nodes are declared to have 
pseudo faults and are treated for routing purposes as if they were faulty, so that the 
regions do become square. Figure 4.17 provides an example. Each node knows the 
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FIGURE 4.17 Faulty regions must be square. 

distance along each direction (east, west, north, and south) to the nearest faulty 
region in that direction. 

The idea of origin-based routing is to define one node as the origin. By restrict¬ 
ing ourselves to the case in which there are no more than N — 1 failures in the mesh, 
we can ensure that the origin is chosen so that its row and column do not have any 
faulty nodes. Suppose we want to send a message from node S to node D. The 
path from S to D is divided into an IN-path, consisting of edges that take the mes¬ 
sage closer to the origin, and an OUT-path, which takes the message farther away 
from the origin, ultimately reaching the destination. Here, distance is measured in 
terms of the number of hops along the shortest path. In degenerate cases, either 
the IN or the OUT path sets can be empty. 

Key to the functioning of the algorithm is the notion of an outbox associated with 
the destination node, D. The outbox is the smallest rectangular region that contains 
within it both the origin and the destination. See Figure 4.18 for an example. 

Next, we need to define safe nodes. A node V is safe with respect to destination 
D and some set of faulty nodes, T, if both the following conditions are met: 

■ Node V is in the outbox for D. 

■ Given the faulty set T , if neither V nor D is faulty, there exists a fault-free 
OUT-path from V to D. 
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FIGURE 4.18 Example of an outbox. 


Finally, we introduce the notion of a diagonal band. Denote by (xa,}/a) the Carte¬ 
sian coordinates of node A, then the diagonal band for a destination node D is 
the set of all nodes V in the outbox for D satisfying the condition that xy — yv — 
xd ~ }!d + e, where e e {—1,0,1}. 

For example, (xd,i Id) = (3,2) in Figure 4.18 and xd — J/D = 3 — 2 = 1. Thus, any 
node V within the outbox of D such that xy — yv e {0,1,2} is in its diagonal band. 

It is relatively easy to show by induction that the nodes of a diagonal band 
for destination D are safe nodes with respect to D. That is, once we get to a safe 
node, there exists an OUT-path from that node to D. Each step along an OUT-path 
increases the distance of the message to the origin: the message cannot therefore 
be traveling forever in circles. 

The routing algorithm consists of three phases. 

Phase 1. The message is routed on an IN path until it reaches the outbox. At the 
end of phase 1, suppose the message is in node If. 

Phase 2. Compute the distance from If to the nearest safe node in each direction, 
and compare this to the distance to the nearest faulty region in that direction. If 
the safe node is closer than the fault, route to the safe node. Otherwise, continue 
to route on the IN links. 

Phase 3. Once the message is at a safe node 11, if that node has a safe, non-faulty 
neighbor V that is closer to the destination, send it to V. Otherwise, If must be 
on the edge of a faulty region. In such a case, move the message along the edge 













4.4 Further Reading 


141 


of the faulty region toward the destination D, and turn toward the diagonal 

band when it arrives at the corner of the faulty square. 

As an example, return to Figure 4.18 and consider routing a message from node 
S at the northwest end of the network to D. The message first moves along the IN 
links, getting ever closer to the origin. It enters the outbox at node A. Since there is 
a failure directly east of A, it continues on the IN links until it reaches the origin. 
Then it continues, skirting the edge of the faulty region until it reaches node B. At 
this point, it recognizes the existence of a safe node immediately to the north and 
sends the message through this node to the destination. 

For the case in which there are more than N — 1 failures in the mesh, we refer 
the reader to the Further Reading section for pointers to the literature. 

4.4 Further Reading 

Graph-theoretic connectivity is described in textbooks on graph theory. See, for 
example, [9,15]. An MS thesis [36] provides more up-to-date information on the 
use of connectivity in the study of network reliability. The notion of persistence 
was introduced in [8]. 

Several variations on the connectivity measure have been proposed. Conditional 
connectivity has been defined in [16] as follows: the node (link) conditional con¬ 
nectivity with respect to any network property P is the smallest number of nodes 
(links) which must be removed from the network so that it is disconnected and 
every component that is left has the property P. An example for the property P is: 
"the component has at most k nodes." A variation on this connectivity measure 
was presented in [19]. 

Another measure, called network resilience was introduced in [27], Network re¬ 
silience is defined with respect to some given probability threshold, p. Let P(i) 
denote the probability that the network is disconnected exactly after the fth-node 
failure (but not before that) and assume that nodes fail according to some given 
probability law. Then, the network resilience is the maximum v such that 

V 

i= 1 

A third measure, call toughness was introduced in [11]. Toughness focuses on the 
number of components a network can be broken down into after a certain number 
of node failures. A network is said to have toughness t if the failure of any set of k 
of its nodes results in at most max{l,fc/ 1} components. The greater the toughness, 
the fewer the components into which the graph splinters. Some related graph- 
theoretical work has been reported in [6]. A recent review of various measures of 
robustness and resilience of networks appears in [20]. 

The extra-stage network was described in [1], The dependability analysis of the 
multistage and the extra-stage networks appears in [21-23]. Other fault-tolerant 
multistage networks are described in [2], The bandwidth of multistage and cross- 
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bar networks was investigated in [29]. The dependability of meshes was inves¬ 
tigated extensively with a study appearing in [26]. Interstitial redundancy for 
meshes was introduced in [33]. Several measures for hypercube reliability have 
been proposed and calculated. For a good summary, see [34], The Cube-Connected 
Cycles network was introduced in [31] and a routing algorithm for it was devel¬ 
oped in [25] where an expression for the diameter is also presented. Several pro¬ 
posals for modifying this network to increase its reliability exist, e.g., [4,35]. Loop 
topologies have been studied extensively. The analysis which we present in this 
chapter is based on [32]. A more recent paper citing many past publications is 
[30]. Path (or terminal) reliability is studied in [17]. A good source for network 
topologies in general is [13]. 

Fault-tolerant routing for hypercubes is presented in [7,10]. Such routing relies 
on a depth-first strategy: see any standard book on algorithms, e.g., [3,12], The 
origin-based scheme for routing in meshes was introduced in [24], The treatment 
there is more general, including the case in which there are N or more failures in 
the mesh. 

4.5 Exercises 

1. The node (link) connectivity of a graph is the minimum number of node- 
disjoint (link-disjoint) paths between any pair of nodes. Show that the node 
connectivity of a graph can be no greater than its link connectivity, and that 
neither the node nor the link connectivity can exceed the minimum node¬ 
degree of the graph (the degree of a node is the number of edges incident on 
it). In particular, show that for a graph with l links and n nodes, the minimum 
node-degree can never exceed [26/nj. 

2. In this problem, we will study the resilience of a number of networks using 
simulation (If you are unfamiliar with simulation, it may be helpful to skim 
through Chapter 10 on simulation techniques). Assume that nodes fail with 
probability q c and individual links with probability qi . All failures are inde¬ 
pendent of one another, and a node failure takes with it all the links that are 
incident on it. Vary q c and qi between 0.01 and 0.25, and find the probability 
that the network is disconnected. Do this for each of the following networks: 

a. n x n rectangular mesh, for n = 10,20,30,40. 

b. nxn interstitial mesh with (1,4) interstitial redundancy, for n — 10,20,30,40. 
C. n-dimensional hypercube, for n = 3,4,6,8,10,12. 

3. For the networks listed above, find the diameter stability vector, DS. 

4. Consider a 2 fc -input butterfly network, in which the input and output feed 
the same nodes (see the left subfigure of Figure 4.4). Write a simulation pro¬ 
gram to find the probability that the network is disconnected (even if we allow 
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multiple passes through it by routing through intermediate nodes to get to the 
ultimate destination), for k = 4 and 5, varying the probability of a switchbox 
failure from q s = 0.01 to Cj s = 0.25. Assume that if a switchbox fails, it acts as 
an open circuit. 

5. Consider an 8 x 8 butterfly network. Suppose that each processor generates a 
new request every cycle. This request is independent of whether or not its pre¬ 
vious request was satisfied, and is directed to memory module 0 with proba¬ 
bility 1/2 and to memory module i with probability 1/14, for i e {1,2,...,7}. 
Obtain the bandwidth of this network. 

6 . We showed how to obtain the probability, for a multistage network, that a 
given processor is unable to connect to any memory. In our analysis, only link 
failures were considered. Extend the analysis to include switchbox failures, 
that occur with probability q s . Assume that link and switchbox failures are all 
mutually independent of one another. 

7. In a 4 x 4 multistage butterfly network, pi is the probability that a link is 
fault-free. Write expressions for the bandwidth BW, connectability Q, and the 
expected number of accessible processors. Assume that a processor generates 
memory requests with probability p r . Assume that switchboxes do not fail. 

8 . Prove that the extra-stage butterfly network can tolerate the failure of up to 
one switchbox and still retain connectivity from any input to any output. (As¬ 
sume that if the failed switchbox is either in the extra or the output stages, its 
bypass multiplexer is still functional.) 

9. Compare the reliability of an N x M interstitial mesh (with M and N both even 
numbers) to that of a regular N x M mesh, given that each node has a relia¬ 
bility R(t) and links are fault-free. For what values of R(t) will the interstitial 
mesh have a higher reliability? 

10. Derive an approximate expression for the reliability of a square (4,4) intersti¬ 
tial redundancy array with 16 primary nodes and 9 spares. Denote the relia¬ 
bility of a node by R and assume that the links are fault-free. 

11. A 3 x 3 crossbar has been augmented by adding a row and a column, and 
input demultiplexers and output multiplexers. Assume that a switchbox can 
fail with probability q s and when it fails all the incident links are disconnected. 
Also assume that all links are fault-free but multiplexers and demultiplexers 
can fail with probability q m . Write expressions for the reliability of the original 
3x3 crossbar and for the fault-tolerant crossbar. (For the purposes of this 
question, the reliability of the fault-tolerant crossbar is the probability that 
there is a functioning 3x3 crossbar embedded within the 4x4 system.) Will 
the fault-tolerant crossbar always have a higher reliability than the original 
3x3 crossbar? 
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12. Show that the three cases enumerated in connection with the derivation of the 
hypercube network reliability (Section 4.2.4) are mutually exclusive. Further, 
show that H„ is connected under each of these cases. Assume that q c — 0, i.e., 
that the nodes do not fail. 

13. Obtain by simulation the network reliability of H n for n — 5, 6 ,7. Assume that 
q c = 0. Compare this result in each instance with the lower bound that we 
derived. 

14. The links in an H 3 hypercube are directed from the node with the lower index 
to the node with the higher index. Calculate the path reliability for the source 
node 0 and the destination node 7. Denote by p K j the probability that the link 
from node i to node j is operational and assume that all nodes are fault-free. 

15. All the links in a given 3x3 torus network are directed as shown in the di¬ 
agram below. Calculate the terminal reliability for the source node 1 and the 
destination node 0. Denote by p,-y the probability that the link from node i to 
node j is operational and assume that all nodes are fault-free. 



16. Generate random graphs in the following way. Start with n nodes; the prob¬ 
ability that there is a (bidirectional) link connecting nodes i and j is p e . Vary 
p e between 0.2 to 0.8 in steps of 0 . 1 , and answer the following for each value 
of p e . 

a. What fraction of these networks are connected? 

b. Within the subset of connected networks, if links can fail with probability 
qt and nodes never fail, what is the diameter stability vector, DS, of the 
graph? Vary qt between 0.01 and 0.25. 

17. In this question, you will use simulation to study the performance of the 
hypercube routing algorithm studied in Section 4.3.1. (If you are unfamiliar 
with simulation, it may be helpful to skim through Chapter 10 on simulation 
techniques.) Assume that links fail with probability qt and that nodes never 
fail. Generate message source and destination pairs at random; for each such 
source and destination between which a path exists, determine the ratio of 
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the shortest distance between them and the distance of the path that is actu¬ 
ally discovered by the routing algorithm. Plot this number for hypercubes of 
dimension 4,6,8,10,12 as a function of cj(, where qe varies from 0.01 to 0.25. 
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Software Fault 
Tolerance 


Much has been written about why software is so defect prone and about why 
the problem of designing and writing software is so intrinsically difficult. Re¬ 
searchers recognize both the essential and accidental difficulties of producing cor¬ 
rect software. Essential difficulties arise from the inherent challenge of under¬ 
standing a complex application and operating environment, and from having to 
construct a structure comprising an extremely large number of states, with very 
complex state-transition rules. Further, software is subject to frequent modifica¬ 
tions, as new features are added to adapt to changing application needs. In addi¬ 
tion, as hardware and operating system platforms change with time, the software 
has to adjust appropriately. Finally, software is often used to paper over incompat¬ 
ibilities between interacting system components. 

Accidental difficulties in producing good software arise from the fact that peo¬ 
ple make mistakes in even relatively simple tasks. Translating the detailed design 
into correctly working code may not require such advanced skills as creating a 
correct design in the first place but is also mistake prone. 

A great deal of work has gone into techniques to reduce the defect rate of 
modern software. These techniques rely on extensive procedures to test software 
programs for correctness and completeness. Testing, however, can never conclu¬ 
sively verify the correctness of an arbitrary program. This can only be approached 
through a formal mathematical proof. Constructing such formal proofs is currently 
the subject of much active research; however, the state of the art at the present time 
is rather primitive, and formal program proving is applicable only to small pieces 
of software. As a result, it is a reasonable assumption that any large piece of soft¬ 
ware that is currently in use contains defects. 
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Consequently, after doing everything possible to reduce the error rate of in¬ 
dividual programs, we have to turn to fault-tolerance techniques to mitigate the 
impact of software defects (bugs). These techniques are the subject of this chapter. 

Acceptance Tests 

As with hardware systems, an important step in any attempt to tolerate faults is to 
detect them. A common way to detect software defects is through acceptance tests. 
These are used in wrappers and in recovery blocks, both of which are important 
software fault-tolerance mechanisms and will be discussed later. 

If your thermometer were to read —40°C on a sweltering midsummer day, you 
would suspect it was malfunctioning. This is an example of an acceptance test. An 
acceptance test is essentially a check of reasonableness. Most acceptance tests fall 
into one of the following categories. 

Timing Checks. One of the simplest checks is timing. If we have a rough idea of 
how long the code should run, a watchdog timer can be set appropriately. When 
the timer goes off, the system can assume that a failure has occurred (either a 
hardware failure or something in the software that caused the node to "hang"). 
The timing check can be used in parallel with other acceptance tests. 

Verification of Output. In some cases, the acceptance test is suggested naturally 
from the problem itself. That is, the nature of the problem is such that although 
the problem itself is difficult to solve, it is much easier to check that the answer 
is correct and it is also less likely that the check itself will be incorrect. To take a 
human analogy, solving a jigsaw puzzle can take a long time; checking to see that 
the puzzle has been correctly put together is trivial and takes just a glance. 

Examples of such problems are calculating the square root (square the result 
to check if you get the original number back), the factorization of large numbers 
(multiply the factors together), the solution of equations (substitute the alleged 
solution into the original equations), and sorting. Note that in sorting, it is not 
enough merely to check that the numbers are sorted: we have also to verify that 
all the numbers at the input are included in the output. 

Sometimes, to save time, we will restrict ourselves to probabilistic checks. These 
do not guarantee that all erroneous outputs will be caught even if the checks are 
executed perfectly, but have the advantage of requiring less time. One example of 
such check for the correctness of matrix multiplication is as follows. 

Suppose we multiply two n x n integer matrices A and B to produce C. To check 
the result without repeating the matrix multiplication, we may select at random an 
nxl vector of integers, R, and carry out the operations Mi — Ax(BxR) and M 2 = 
C x R. If Mi ^ M 2 , then we know that an error has occurred. If Mi = M 2 , that still 
does not prove that the original result C was correct; however, it is very improbable 
that the random vector R was selected such that Mi = M 2 even if A x B ^ C. To 
further reduce this probability, we may select another n x 1 vector and repeat the 
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check. The complexity of this test is 0(mn 2 ) where m is the number of vectors 
selected. 

Range Checks. In other cases, we do not have such convenient and obvious 
approaches to checking the correctness of the output. In such situations, range 
checks can be used. That is, we use our knowledge of the application to set ac¬ 
ceptable bounds for the output: if it falls outside these bounds, it is declared to 
be erroneous. Such bounds may be either preset or some simple function of the 
inputs. If the latter, the function has to be simple enough to implement so that the 
probability of the acceptance test software itself being faulty is sufficiently low. 

For example, consider a remote-sensing satellite that takes thermal imagery of 
the earth. We could obviously set bounds on the temperature range and regard 
any output outside these bounds as indicating an error. Furthermore, we could 
use spatial correlations, which means looking for excessive differences between 
the temperatures in adjacent areas and flagging an error if the differences cannot 
be explained by physical features (such as volcanoes). 

When setting the bounds on acceptance tests, we have to balance two para¬ 
meters: sensitivity and specificity. We have encountered these quantities before in 
Chapter 2: recall that sensitivity is the probability that the acceptance test catches 
an erroneous output. To be more exact, it is the conditional probability that the 
test declares an error, given the output is erroneous. Specificity, in contrast, is the 
conditional probability that, given that the acceptance test declares an error, it is in¬ 
deed an error and not a correct output that happens to fall outside the test bounds. 
A closely related parameter is the probability of false alarm, which is the conditional 
probability that the test declares as erroneous an output that is actually correct. 

An increase in sensitivity can be achieved by narrowing the bounds. Unfortu¬ 
nately, this would at the same time decrease the specificity and increase the proba¬ 
bility of false alarms. In an absurdly extreme case, we could narrow the acceptance 
range to zero, so that every output flags an error! In such a case the sensitivity 
would be 100%, but the probability of a false alarm would be high because every 
output, correct or not, is sure to be declared erroneous. The specificity in such a 
case would be low—equal to the underlying error rate. Clearly, such an acceptance 
test is useless. 

Single-Version Fault Tolerance 

In this section, we consider ways by which individual pieces of software can 
be made more robust. We start by looking at wrappers, which are robustness¬ 
enhancing interfaces for software modules. Then, we discuss software rejuvena¬ 
tion, and finally, we describe the use of data diversity. 

5.2.1 Wrappers 

As its name implies, a wrapper is a piece of software that encapsulates the given 
program when it is being executed (see Figure 5.1). We can wrap almost any level 
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FIGURE 5.1 A wrapper. 


of software: examples include application software, middleware, and even an op¬ 
erating system kernel. Inputs from the outside world to the wrapped entity are 
intercepted by the wrapper, which decides whether to pass them on or to signal 
an exception to the system. Similarly, outputs from the wrapped software are also 
filtered by the wrapper. 

Wrappers became popular when people started using Commercial Off-the- 
Shelf (COTS) software components for high-reliability applications. COTS com¬ 
ponents are written for general-purpose applications, for which errors are an an¬ 
noyance but not a calamity. Before such components can be used in applications 
requiring high reliability, they need to be embedded in some environment that re¬ 
duces their error rate. This environment (the wrapper) has to head off inputs to the 
software that are either outside the specified range or are known to cause errors; 
similarly, the wrapper passes the output through a suitable acceptance test before 
releasing it. If the output fails the acceptance test, this fact must be conveyed to 
the system, which then decides on an appropriate course of action. 

Wrappers are specific to the wrapped entity and the system. Here are some 
examples of their use. 

(1) Dealing with Buffer Overflow. The C programming language does not per¬ 
form range checking for arrays, which can cause either accidental or maliciously 
intended damage. Writing a large string into a small buffer causes buffer overflow; 
since no range checking is performed, a region of memory outside the buffer is 
overwritten. For example, consider the strcpyQ function in C, which copies strings 
from one place to another. If one executes the call strcpy(strl, str2), where strl is a 
buffer of size 5 and strl is a string of length 25, the resulting buffer overflow would 
overwrite a region of memory outside the strl buffer. Such overflows have been 
exploited by hackers to cause harm. 

A wrapper can check to ensure that such overflows do not happen, for example, 
by checking that the buffer is large enough for the designated string to be copied. 
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Violating this rule prevents the strcpyQ function from being called; instead, the 
wrapper returns an error or raises an exception. 

(2) Checking the Correctness of the Scheduler. Consider a wrapper around the 
task scheduler in a fault-tolerant, real-time system. Unlike general-purpose oper¬ 
ating systems, such schedulers do not generally use round-robin scheduling. One 
real-time scheduling algorithm is Earliest Deadline First (EDF), in which, as the term 
implies, the system executes the task with the earliest deadline among all the tasks 
that are ready to run. This is subject to constraints on preemptibility, because some 
tasks may not be preemptible in certain parts of their executions. 

Such a scheduler can be wrapped by having the wrapper verify that the 
scheduling algorithm is being correctly executed, so that the scheduler always se¬ 
lects the ready task with the earliest deadline and that any arriving task with an 
earlier deadline preempts the executing task (assuming the latter is preemptible). 
To do its job, the wrapper obviously needs information about which tasks are 
ready to run and their deadlines and about whether the currently executing task 
is preemptible. To obtain this information, it may be necessary to get the vendor 
of the scheduler software to provide a suitable interface. 

(3) Using Software with Known Bugs. Suppose we are using a software mod¬ 
ule with known bugs. That is, we have found, either through intensive testing or 
through field reports, that the software fails for a certain set of inputs, S. Suppose, 
further, that the software vendor has not (yet) put out a version that corrects these 
bugs. Then, we can implement a wrapper that intercepts the inputs to that soft¬ 
ware and checks to see if those inputs are in the set S. If not, it forwards them to 
the software module for execution; if yes, it returns a suitable exception to the sys¬ 
tem. Alternatively, the wrapper can redirect the input to some alternative, custom 
written, code that handles inputs in S. 

(4) Using a Wrapper to Check for Correct Output. Such a wrapper includes an 
acceptance test through which every output is filtered. If the output passes the 
test, it is forwarded outside. If not, an exception is raised, and the system has to 
deal with a suspicious output. 

Our ability to successfully wrap a piece of software depends on several factors: 

1. Quality of the Acceptance Tests. This is application dependent and has a direct 
impact on the ability of the wrapper to stop erroneous outputs. 

2. Availability of Necessary Information from the Wrapped Component. Often, the 
wrapped component is a "black box" and all we can observe about its behavior 
is the output produced in response to a given input; in such cases, the wrap¬ 
per will be somewhat limited. For example, our scheduler wrapper would be 
impossible to implement without information about the deadlines of the tasks 
waiting to run. Ideally, we would like complete access to the source code; where 
this is impossible for commercial or other reasons, we would like to have the 
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vendors themselves provide well-defined interfaces by which the wrapper can 
obtain relevant information from the wrapped software. 

3. Extent to Which the Wrapped Softivare Module Has Been Tested. Extensively testing 
software allows us to identify regions of the input space for which the software 
fails, and reduces the probability of contaminating the system with incorrect 
output. 

5.2.2 Software Rejuvenation 

When your personal computer hangs, the obvious reaction is to reboot it. This is 
an example of softivare rejuvenation. 

As a process executes, it may keep acquiring memory and file locks without 
properly releasing them. Also, its data tend to get corrupted as uncorrected er¬ 
rors accumulate. The process may also consume (without releasing) threads and 
semaphores. If this goes on indefinitely, the process can become faulty and stop ex¬ 
ecuting. To head this off, we can proactively halt the process, clean up its internal 
state, and then restart it. This is called software rejuvenation. 


Rejuvenation Level 

One can rejuvenate at either the application or at the processor level. Rejuvenation 
at the application level consists of suspending an individual application, cleaning 
up its state by garbage collection, reinitialization of data structures, etc., and then 
restarting it. Rejuvenation at the processor level consists of rebooting the proces¬ 
sor and affects all applications running on that processor. If we have a processor 
cluster, it is beneficial to stagger such rejuvenations so that no more than a small 
fraction of the processors are under rejuvenation at any one time. Selecting the ap¬ 
propriate level consists of determining at what level the resources have degraded 
or become exhausted. 

Timing of Rejuvenation 

Software rejuvenation can be based on either time or prediction. 

Time-based rejuvenation consists of rejuvenating at constant intervals. To deter¬ 
mine the optimal inter-rejuvenation period, we must balance the benefits against 
the cost. Let us construct a simple mathematical model to do this. We use the fol¬ 
lowing notation: 

N(t) Expected number of errors over an interval of length t 
(without rejuvenation) 

C e Cost of each error 
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C r Cost of each rejuvenation 
P Inter-rejuvenation period 

By adding up the costs due to rejuvenation and to errors, we obtain the overall 
expected cost of rejuvenation over a period P, denoted by C re j uv (P): 


C Ie]uv (P) = N(P)C e + Cr 

The cost per unit time, C ra t e (P), is then given by 

. . C re j uv (P) N(P)C e + C r 

Crate (t ) = -„- = -n- 


(5.1) 


To get some insight into this expression, let us study three cases for N(P). First, 
consider what happens if the software has the a constant error rate X throughout 
its execution, which implies that N(P) — XP. Substituting this into Equation 5.1, we 
have C r ate(T) = XC,. + C r /P. It is easy to see that to minimize C ra t e (P), we must set 
P — oo. This implies that if the error rate is constant, software rejuvenation should 
not be applied at all. Rejuvenation is useful only to head off a potential increased 
error rate as the software executes. 

Next, consider N(P) — XP 2 . From Equation 5.1, we obtain C ra t e (P) = XPC e + 
C r /P. To minimize this quantity, we find P such that dC ra t e (P)/dP = 0 (and 
d 2 C rate (P)/dP 2 > 0). Differentiating, we find the optimal value of the rejuvenation 

period, denoted by P*, to be P = J jjP- 

The third case is a generalization of the above: N(P) = XP n , n > 1. From Equa¬ 
tion 5.1, we have C ra t e (P) = XP"~ 1 C e + C r /P. Using elementary calculus, as before, 
we find the optimal value of the rejuvenation period to be 


P = 


C, 


(n -1 )XC e 


1/n 


Figure 5.2 shows how the optimal rejuvenation period varies as a function C r /C e 
and n for the model described above. 

To set the period P appropriately, we need to know the values of the parame¬ 
ters C r / C,. and N(t). These can be obtained experimentally by running simulations 
on the software, or alternatively, the system could be made adaptive, with some 
default initial values being chosen to begin with. Over time, as we gather statistics 
reflecting the failure characteristics of the software, the rejuvenation period can be 
adjusted appropriately. 

Prediction-based rejuvenation involves monitoring the system characteristics 
(amount of memory allocated, number of file locks held, and so on), and predict¬ 
ing when the system will fail. For example, if a process is consuming memory at a 
certain rate, the system can estimate when it will run out of memory. Rejuvenation 
then takes place just before the predicted crash. 
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FIGURE 5.2 Optimal rejuvenation period. X = 1; time units are arbitrary; curve labels indicate 
Cr/Ce. 


The software that implements prediction-based rejuvenation must have access 
to enough state information to make such predictions. If it comes as part of the op¬ 
erating system, such information is easy to collect. If it is a package that runs atop 
the operating system with no special privileges, it will be constrained to using 
whatever interfaces are provided by the operating system to collect status infor¬ 
mation. For example, the Linux system provides the following utilities: 

■ vmstat provides information about processor utilization, memory and pag¬ 
ing activity, traps, and I/O. 

■ iostnt outputs the percentage CPU utilization at the user and system levels, 
as well as a report on the usage of each I/O device. 

■ netstat indicates network connections, routing tables, and a table of all the 
network interfaces. 

■ nfsstat provides information about network file server kernel statistics. 

Once the appropriate status information has been collected, trends can be identi¬ 
fied and a prediction made as to when these trends will cause errors to occur. For 
example, if we are tracking the allocation of memory to a process, we might do a 
least-squares fit of a polynomial to the memory allocations over some window of 
the recent past. 

The simplest such fit is a straight line, or a polynomial of degree one, f(t) — 
mt + c. More complex ones may involve a higher-degree polynomial, say of de¬ 
gree n. Suppose the selected window of the recent past consist of k time in¬ 
stances fi < t 2 < ■ ■ ■ < fjt, where h is the most recent one. Given the measurements 
/x(fi), /xfe)/ ■ • • / b(h)/ where /r(f,) is the allocated memory at time f/, we seek to find 
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the coefficients of the polynomial 

/(f) = m n t n + m n - if" -1 H-b m\t + m 0 

so as to minimize the quantity 

k 

-fih)] 2 

1=1 

This polynomial can then be used to extrapolate into the future and predict when 
the process will run out of memory. 

In the standard least-squares fit, each observed point p(/) has the same weight 
in determining the fit. A variation of this procedure is the weighted least-squares 
fit, in which we seek to minimize the weighted sum of the squares. In our memory 
allocation example, we would choose weights W\,W 2 , ■ ■ ■ ,Wk and then determine 
the coefficients of/(f) such that the quantity 

k 

£®i[M(f;)-/(fi)] 2 

i= 1 

is minimized. Having weights allows us to give greater emphasis to certain points. 
For example, if we use w\ < W 2 < ■ ■ ■ < W\ t, recent data will influence the fit more 
than older data. 

The above curve-fitting approaches are all vulnerable to the impact of a few out¬ 
lying points (points that are unusually high or low), which can have a distorting 
effect on the fit. Techniques are available to make the fit more robust by reducing 
the impact of such points; see the Further Reading section for a pointer to more 
information. 

Combined Approach. The two approaches described above can be combined by 
rejuvenating at either the scheduled P or at the time when the next error is pre¬ 
dicted to happen, whichever comes first. 

5.2.3 Data Diversity 

The input space of a program is the space spanned by all possible inputs. This 
space can be divided into failure and nonfailure regions. The program fails if and 
only if an input from the failure region is applied. 

Failure regions come in every shape and size. Input spaces typically have a 
large number of dimensions, but we can visualize them only in the unrealistically 
simple case of a two-dimensional input space. Figure 5.3 shows two arbitrarily 
drawn failure regions. In both cases, the failure region occupies the same fraction 
of the input area, but in Figure 5.3a it consists of a number of relatively small 
islands, whereas in Figure 5.3b it consists of a single large, contiguous, area. In both 
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(a) Small, scattered failure regions (b) Large, contiguous failure region 


FIGURE 5.3 Failure regions. 

cases, the software will fail for the same fraction of all possible inputs. The crucial 
difference is that in Figure 5.3a, a small perturbation of the inputs is sufficient to 
move them out of a failure region to a nonfailure region. 

Failure regions such as in Figure 5.3a suggest a possible fault-tolerance ap¬ 
proach: consider perturbing the input slightly and hope that if the original input 
falls in a failure region, the perturbed input will fall in a nonfailure region. This 
general approach is called data diversity. Flow it is actually implemented depends 
on the error-detection mechanism. If only one copy of the software is executed at 
any one time and an acceptance test is used to detect errors, then we can recom¬ 
pute with perturbed inputs and recheck the resulting output. If massive redun¬ 
dancy is used, we may apply slightly different input sets to different versions of 
the program and vote on their output (see Section 5.3). 

Perturbation of the input data can be done either explicitly or implicitly. Explicit 
perturbation consists of adding a small deviation term to a selected subset of the 
inputs. Implicit perturbation involves gathering inputs to the program in such a 
way that we can expect them to be slightly different. For example, suppose we 
have software controlling an industrial process whose inputs are the pressure and 
temperature of a refrigeration equipment. Every second, these parameters (pi, f,) 
are measured and then input to the controller. Now, from physical considerations, 
we can expect that the pressure measured in sample i is not much different from 
that in sample i — 1. Implicit perturbation in this context may consist of using 
(f i-l, ti) as an input alternative to (/?;, f,). With luck, if (p,,t/) is in a failure region, 
(p,_l,t/) will not be, thus providing some resilience. Whether or not this is accept¬ 
able obviously depends on the dynamics of the application and the sampling rate. 
If, as is often the case, we sample at a higher rate than is absolutely necessary, this 
approach is likely to be useful. 

Another approach is to reorder the inputs. A somewhat contrived example is 
the program that adds a set of three input floating-point numbers, a,b,c. If the 
inputs are in the order a, b, c, then it first computes a + b and then adds c to this 
partial sum. Consider the case a — 2.2E + 20, b = 5, c — — 2.2E + 20. Depending 
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on the precision used (e.g., if the significant [mantissa] field of the floating-point 
number has room for fewer than 20 decimal digits, which is about 66 bits) it is 
possible that a + b as calculated will be 2.2E + 20, so that the final result will be 
a + b + c — 0, which is incorrect. Now, change the order of the inputs; let it be a,c,b. 
Then, a + c = 0, so that a+c+b— 5. 

There is one important difference between the two examples we have seen. Al¬ 
though in both cases we are reexpressing the inputs, the refrigeration controller 
was an example of inexact reexpression, whereas the example of calculating a + b + c 
is an instance of exact reexpression. In the first example, the software is attempting 
to compute some function ,f(p,t), of the pressure and temperature, yet for inputs 
(p, t ) falling in a failure region, the actual output of the software will not equal 
f(p,t); we are also likely to have/(p„ f;) ^/(p;_i,f;). In the second example, that of 
calculating a + b + c, we should in theory have a + b + c —a + c + b, and it is only 
the limitations of the implementation (in this case the limited precision provided 
by floating-point arithmetic) that cause an error on this sequence a, b,c of inputs. 

When exact reexpression is used, the associated output can be used as is (as long 
as it passes the acceptance test or the vote on the multiple versions of the program). 
If we have inexact reexpression, the output will not be exactly what was meant to 
be computed. Depending on the application and the amount of perturbation, we 
may or may not attempt to correct for the perturbation before using the output. 
If the application is somewhat robust, we may use the raw output as a somewhat 
degraded, but still acceptable, alternative to the desired output; if it is not, we must 
correct for the perturbation. 

One way to correct the output for the perturbation is to use the Taylor expan¬ 
sion. Recall that for one variable (assuming that the function is differentiable to 
any degree) the Taylor expansion of f(t) around the point to is 


/(o=/(f 0 )+E 

n =1 


n\ 


where/^(hi) is the value at t — to of the nth derivative of/(f) with respect to t. 

In other cases, we may not have the desired function in analytic form and must 
use other approaches to correct the output. 


5.2.4 Software Implemented Hardware Fault 
Tolerance (SIHFT) 

Data diversity can be combined with time redundancy to construct techniques for 
Software Implemented Hardware Fault Tolerance (SIHFT) with the goal of detect¬ 
ing hardware faults. A SIHFT technique can provide an inexpensive alternative 
to hardware and / or information redundancy techniques and can be especially at¬ 
tractive when using COTS microprocessors which typically do not support error 
detection. 
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Stuck-at-0 fault 


FIGURE 5.4 An n-bit bus with a permanent stuck-at-0 fault. 

Suppose the program has all integer variables and constants. It can be trans¬ 
formed to a new program in which all variables and constants are multiplied by 
a constant k (called the diversity factor), and whose final results are expected to 
be k times the results of the original program. When both the original and the 
transformed programs are executed on the same hardware (i.e., using time redun¬ 
dancy), the results of these two programs will be affected by hardware faults in 
different ways, depending on the value of k. By checking whether the results of 
the transformed program are k times the results produced by the original program, 
hardware faults can be detected. 

How do we select a suitable value of k? The selected value should result in a 
high probability of detecting a fault, yet it should be small enough so as not cause 
an overflow or underflow which may prevent us from correctly comparing the 
outputs of the two programs. Furthermore, if the original program includes logic 
operations such as bit-wise XOR or AND, we should restrict ourselves to values 
that are of the form k — 2 1 , with an integer l, since in this case multiplication by k 
becomes a simple shift operation. 


■ EXAMPLE 

Consider an n-bit bus shown in Figure 5.4 and suppose that bit i of the bus 
has a permanent stuck-at-0 fault. If the data sent over the bus has its ith bit 
equal to 1, the stuck-at fault will result in erroneous data being received at 
the destination. If a transformed program with k — 2 is executed on the same 
hardware, the ith bit of the data will now use line (i + 1) of the bus and will 
not be affected by the fault. The executions of the two programs will yield 
different results, indicating the presence of a fault. 

Obviously, the stuck-at-0 fault will not be detected if both bits i and (i — 1) 
of the data that is forwarded on the bus are 0. Assuming that all 2" possible 
values on the //-bit bus are equally likely, this event will occur with probability 
0.25. If, however, the transformed program uses k = — 1 (meaning that every 
variable and constant in the program undergoes a two's complement opera- 




5.2 Single-Version Fault Tolerance 


159 


i = 0; 


i = 0; 

x — 3; 


x — 6 ; 

y= i; 


y — 2; 

while ( i < 5) { 


while ( i < 10) { 

y=y*(x+ 0 ; 


y = y*(x + i)/2; 

i = i + 2; 

\ 


i = i + 4 ; 

\ 

/ 

z = y; 


J 

z:=y; 


(a) The original program (b) The transformed program 


FIGURE 5.5 An example of a program transformation for k = 2. 


tion), almost all Os in the original program will turn into Is in the transformed 
program, greatly reducing the probability of an undetected fault. ■ 


The risk of overflow while executing the transformed program exists even for 
small values of k. In particular, even k = — 1 can generate an overflow if the original 
variable assumed the value of the largest negative integer number that can be 
represented using the two's complement scheme (for a 32-bit integer this is —2 31 ). 
Thus, the transformed program should take appropriate precautions, for example, 
by scaling up the type of integer used for that variable. Range analysis can be 
performed to determine which variables must be scaled up to avoid overflows. 

The actual transformation of the program, given the value of k, is quite straight¬ 
forward and can be easily automated. The example in Figure 5.5 shows the trans¬ 
formation for k = 2. Note that the result of the multiplication in the transformed 
program must be divided by k to ensure proper scaling of the variable y. 

If floating-point variables are used in the program, some of the simple choices 
for k considered above are no longer adequate. For example, for k — —1, only the 
sign bit of the transformed variable will change (assuming the IEEE standard rep¬ 
resentation of floating-point numbers is followed; see the Further Reading sec¬ 
tion). Even selecting k = 2 L for an integer l is inappropriate, since multiplying by 
such a k will only affect the exponent field. The significant field will remain intact, 
and any error in it will not be detected. Both the significant field and the exponent 
field must, therefore, be multiplied, possibly by two different values of k. 

To select value(s) of k for a given program such that the SIFIFT technique will 
provide a high coverage (detect a large fraction of the hardware faults) we can 
carry out experimental studies by injecting faults into a simulation of the hard¬ 
ware (see Chapter 10 for a discussion of fault injection) and determine the fault¬ 
detecting capability for each candidate value of k. 
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X 

Y 



Z 

Error 


FIGURE 5.6 Example of the use of recomputing with shifted operands. 

Recomputing with Shifted Operands (RESO) 

The Recomputing with Shifted Operands (RESO) approach is similar to SIHFT, 
with the main difference being that the hardware is modified to support fault de¬ 
tection. In this approach, each unit that executes either an arithmetic or a logic op¬ 
eration is modified so that it first executes the operation on the original operands 
and then re-executes the same operation on transformed operands. The same is¬ 
sues that had to be resolved for the SIHFT technique exist for the RESO technique 
as well. Here, too, the transformations of the operands are limited to simple shifts 
which correspond to k being of the form k — T with l an integer. Avoiding an over¬ 
flow when executing the transformed computation is easier for RESO than for SI¬ 
HFT, since the datapath of the modified hardware unit can be extended to include 
some extra bits, and thus avoid overflow. Figure 5.6 shows an ALU (Arithmetic 
and Logic Unit capable of executing addition, subtraction, and bit-wise logic oper¬ 
ations) that has been modified to support the RESO technique. In the first step, the 
two original operands X and Y are, for example, added without being shifted, and 
the result Z stored in the register. In the next step, the two operands are shifted by 
l bit positions and then added. The result of this second addition is then shifted by 
the same number of bit positions, but in the opposite direction, and then compared 
with the contents of the register, using the checker circuit. 

5.3 N-Version Programming 

In this approach to software fault tolerance, N independent teams of programmers 
develop software to the same specifications. These N versions of software are then 
run in parallel, and their output is voted on. The hope is that if the programs are 
developed independently, it is very unlikely that they will fail on the same inputs. 
Indeed, if the bugs are assumed to be statistically independent and each has the 
same probability, q, of occurring, then the probability of software failure of an 
N-version program can be computed in a way similar to that of an NMR cluster 
(see Chapter 2). That is, the probability of no more than m defective versions out 
of N versions, under the defect/bug independence assumption, is 

Pmd(N,m,q) = (^) ^'(1 - q) N ~' 

i '=0 ' ' 
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N-version programming is far from trivial to implement. We start our discus¬ 
sion by showing how difficult it can be to even arrive at a consensus among cor¬ 
rectly functioning versions. 

5.3.1 Consistent Comparison Problem 

Consider N independently written software versions, V \,..., Vfv, for some appli¬ 
cation. Suppose the overall structure of each version involves computing some 
quantity, x, and comparing it with a constant, c. Let x, denote the value of x as 
computed by version V,. The comparison with c is said to be consistent if either 
Xi ^ c for all i = 1,.. ,,N, or x\ <c for all i = 1,.. ,,N. 

Consider an application such that 


if (f(p 

, t)<c) 


take 

acti on 

A1 

el se 



take 

acti on 

A2 


end if 

The job of each version is to output the action to be taken. In such a case, we clearly 
want all functional versions to be consistent in their comparisons. 

Since the versions are written independently and may actually use different 
algorithms to compute the function/(p, f), we expect that their respective calcu¬ 
lations may yield values for f(p, t) that differ slightly. To take a concrete exam¬ 
ple, let c — 1.0000 and N — 3. Suppose the versions V\, Vi, and V 3 output values 
0.9999,0.9998, and 1.0001, respectively. Then, x\ < c, xi < c but X 3 > c: the compar¬ 
isons are not consistent. As a result, V\ and Vi will order action A1 to be taken and 
V 3 will order action A2, even though all three versions are functioning correctly. 

Such inconsistent comparisons can occur even if the precision is so high that the 
version outputs deviate by very little from one another: there is no way to guaran¬ 
tee a general solution to the consistent comparison problem. We can establish this 
by showing that any algorithm which guarantees that any two n -bit integers which 
differ by less than 2 I( will be mapped to the same m -bit output (where m + k < n), 
must be the trivial algorithm that maps every input to the same number. Suppose 
we have such an algorithm: we start the proof with k— 1.0 and 1 differ by less 
than 2 I( , so the algorithm will map both of them to the same number, say a. Simi¬ 
larly, 1 and 2 differ by less than 2 k , so they will also be mapped to a. Proceeding 
in this way, we can easily show that 3,4,... will all be mapped by this algorithm 
to a, which means that this must be the trivial algorithm that maps all integers to 
the same number, a. 

The above discussion assumes that it is integers that are being compared; how¬ 
ever, it is easy to prove that a similar result holds for real numbers of finite preci¬ 
sion that differ even slightly from one another. 

This problem may arise whenever the versions compare a variable with a given 
threshold. Given that the software may involve a large number of such compar- 
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isons, the potential exists for each version to produce distinct, unequal results, 
even if no errors have occurred so long as even minor differences exist in the val¬ 
ues being calculated. Such differences cannot usually be removed, because each 
version may use a different algorithm and in any case is programmed indepen¬ 
dently 

Why is this a problem? After all, if nonfailing versions can differ in their output, 
it is reasonable to suppose that the output of any of them would be acceptable 
to the application. Although this is true, the system has no means to determine 
whether the outputs are in disagreement because they are erroneous or because 
of the consistent comparison problem. Note that it is possible for the nonfailing 
versions to disagree due to this problem while multiple failed versions produce 
identical wrong outputs (due to a common bug). The system would then most 
likely select the wrong output. 

One can, in principle, bypass the consistent comparison problem completely, by 
having the versions decide on a consensus value of the variable before carrying out 
the comparison. That is, before checking if some variable x > c, the versions run 
an algorithm to agree on which value of x to use. However, this would add the 
requirement that, where there are multiple comparisons, the order of comparisons 
be specified. Restricting the implementation of the versions in this way can reduce 
version diversity, thus increasing the potential for correlated errors. Also, if the 
number of such comparisons is large, a significant degradation of performance 
could occur because a large number of synchronization points would be created. 
Versions that arrive at the comparison points early would have to wait for the 
slower ones to catch up. 

Another approach that has been suggested is to use confidence signals. While 
carrying out the "x > c?" comparison, each version should consider the difference 
\x — c|. If \x — c\ <8 for some prespecified 8 , the version announces that it has low 
confidence in its output (because there is the potential for it to disagree with the 
other versions). The function that votes on the version outputs could then ignore 
the low-confidence versions or give them a lower weight. Unfortunately, if one 
functional version has \x — c\ <8, chances are quite high that this will also be true 
of other functional versions, whose outputs will also be devalued by the voter. In 
addition, it raises the possibility of an incorrect result that is far from c, outvoting 
multiple correct results, which are (correctly) close to c. 

The frequency with which the consistent comparison problem arises and the 
length of time for which it lasts depend on the nature of the application. In applica¬ 
tions where historical state information is not used (e.g., if the calculation depends 
only on the latest input values and is not a function of past values), the consistent 
comparison problem may occur infrequently and go away fairly quickly. 

5.3.2 Version Independence 

Correlated errors between versions can increase the overall error probability by 
orders of magnitude. For example, consider the case N — 3, which can tolerate up 
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to one failed version for any input. Suppose that the probability that a version 
produces an incorrect output is q — 10 -4 . That is, on the average, each of these 
versions produces an incorrect output once every 10,000 runs. If the versions are 
stochastically independent, then the error probability of the three-version system 
is 

q 3 + 3q 2 (l -i})«3x 10 -8 

Now, suppose stochastic independence does not hold and that there is one de¬ 
fect mode which is common to two of the three versions and is exercised on the 
average once every million runs (that is, about one in every 100 bugs of a ver¬ 
sion is due to a common mistake). Every time this bug is exercised, the system 
will fail. The error probability of the three-version system now increases to over 
10 -6 , which is more than 30 times the error probability of the uncorrelated sys¬ 
tem. 

Let us explore the issue of correlation a little further. Quite often, the input space 
(the space of all possible input patterns) can be subdivided into regions according 
to the probability that an input from that region will cause a version to fail. Thus, 
for example, if there is some numerical instability in a given subset of the input 
space, the error rate for that subspace may be greater than the average error rate 
over the entire space of inputs. Suppose that versions are stochastically indepen¬ 
dent in each subspace, that is, 

Prob{ V\ , IQ both fail|input is from subspace S,[ 

= Prob{IQ fails|input is from S/} • ProbjIQ fails|input is from S,} 

According to the total probability formula, the unconditional probability of failure 
of an individual version is 

Prob}!/, fails} 

= y Prob {Vj fails|input is from S,} • Prob{Input is from S/} (/ = 1,2) 


The unconditional probability that both IQ and VS will fail is 
Prob{IQ, IQ both fail} 

= y, Prob} IQ fails| S/} • Prob {IQ fails|S,} • Probflnput is from S,} 

i 


Let us consider two numerical examples. For ease of exposition, we will assume 
the input space consists of only two subspaces Si and S 2 , and that the probability 
of the input being from Si or S 2 is 0.5. 
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■ EXAMPLE 

The conditional failure probabilities are as follows: 


Version 

Si 

s 2 

Vi 

0.010 

0.001 

v 2 

0.020 

0.003 


The unconditional failure probabilities for the two versions are 

Prob{ Vr fails} = 0.01 x 0.5 + 0.001 x 0.5 = 0.0055 

Prob{ V 2 fails} = 0.02 x 0.5 + 0.003 x 0.5 = 0.0115 

If the two versions were stochastically independent, the probability of both 
failing for the same input would be 

ProbfVi fails} • Prob{ V 2 fails} = 0.0055 x 0.0115 = 6.33 x 10“ 5 

The actual joint failure probability, however, is somewhat greater: 

P(V i, V 2 both fail) = 0.01 x 0.02 x 0.5 + 0.001 x 0.003 x 0.5 = 1.02 x 10“ 4 

The reason is that the two versions' failure propensities are positively corre¬ 
lated: they are both much more prone to failure in Si than in S 2 . ■ 


■ EXAMPLE 

The failure probabilities are as follows: 


Version 

Si 

s 2 

Vi 

0.010 

0.001 

v 2 

0.003 

0.020 


The unconditional failure probabilities of the individual versions are identical 
to those in the previous example. However, the joint failure probability is now 

ProbfVi, V 2 both fail} = 0.01 x 0.003 x 0.5 + 0.001 x 0.02 x 0.5 = 2.5 x 10“ 5 

This is about a five-fold decrease from the corresponding number in the pre¬ 
vious example, and less than half of what it would have been if the versions 
had been stochastically independent. 

The reason is that now the propensities to failure of the two versions are 
negatively correlated: V\ is better in Si than in S 2 , whereas the opposite is 
true for V 2 . Intuitively, V\ and V 2 make up for each other's deficiencies. ■ 
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Ideally, we would therefore like the multiple versions to be negatively corre¬ 
lated; realistically, we expect most correlations to be positive because the ver¬ 
sions are ultimately all addressing the same problem. In any event, the focus in 
N-version programming has historically been on making the versions as stochas¬ 
tically independent as possible, rather than on making them negatively correlated. 

The stochastic independence of versions can be compromised by a number of 
factors. 

■ Common Specifications. If programmers work off the same specification, 
errors in these specifications will propagate to the software. 

■ Intrinsic Difficulty of the Problem. The algorithms being programmed 
may be far more difficult to implement in one subset of the input space than 
in others. Such a correlation in difficulty can translate into multiple versions 
having defects that are triggered by the same input sets. 

■ Common Algorithms. Even if the implementation of the algorithm is cor¬ 
rect, the algorithm itself may contain instabilities in certain regions of the 
input space. If the different versions are implementing the same algorithm, 
then these instabilities will be replicated across the versions. 

■ Cultural Factors. Programmers who are trained to think in similar ways 
can make similar (or the same) mistakes quite independently. Furthermore, 
such correlation can result in ambiguous specifications being interpreted in 
the same erroneous way. 

■ Common Software and Hardware Platforms. The operating environment 
comprises the processors on which the software versions are executed and 
the operating system. If we use the same hardware and operating system, 
faults/defects within these can trigger a correlated failure. Strictly speaking, 
this would not constitute a correlated application softzvare failure; however, 
from the user's point of view, this would still be a failure. Common compil¬ 
ers can also cause correlated failures. 

Independence among the versions can be gained by either incidental diversity or 
forced diversity. Incidental diversity is the by-product of forcing the developers of 
different modules to work independently of one another. Teams working on differ¬ 
ent modules are forbidden to directly communicate with one another. Questions 
regarding ambiguities in the specifications or any other issue have to be addressed 
to some central authority, which makes any necessary corrections and updates all 
the teams. Inspection of the software must be carefully coordinated so that the in¬ 
spectors of one version do not directly or indirectly leak information about another 
version. 
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Forced diversity is a more proactive approach and forces each development 
team to follow some approach that is believed to increase the chances of diversity. 
Here are some of the ways in which this can be forced. 

Use Diverse Specifications. Several researchers have remarked that the major¬ 
ity of software bugs can be traced to the requirements specification. Some even 
claim that two-thirds of all bugs can be laid at the door of faulty specifications! 
This is one important motivation for using diverse specifications. That is, rather 
than working on a common specification, diversity can begin at the specification 
stage. The specifications may be expressed in different formalisms. The hope is 
that specification errors will not coincide across versions, and each specification 
version will trigger a different implementation error profile. It is beginning to be 
accepted that the specifications impact how one thinks about a problem: the same 
problem, if specified differently, may well pose a different level of difficulty to the 
implementor. 

We may also decide to make the various versions have differing capabilities. For 
example, in a three-version system, one of the versions may be more rudimentary 
than the other two, providing a less accurate—but still acceptable—output. The 
hope is that the implementation of a simpler algorithm will be less error-prone 
and more robust (experience less numerical instability). In most cases, the two 
other versions will run correctly. In the (hopefully rare) instances when they do 
not, the third version can save the system (or at least help determine which of the 
two disagreeing other versions is correct). If the third version is very simple, then 
formal methods may be considered to actually prove that it is correct. A similar 
approach of using a simpler version is often used in recovery blocks, which are 
discussed in Section 5.4. 

Use Diverse Programming Languages. Anyone experienced in programming 
knows that the programming language can significantly impact the quality of the 
software that is produced. For example, we would expect a program written in 
assembly language to be more bug-prone than is one in a higher-level language. 
The nature of the bugs can also be different. In our discussion of wrappers (in 
Section 5.2.1), we saw that it is possible to get programs written in C to overflow 
their allocated memory. Such bugs would be impossible in a language that strictly 
manages memory. Errors arising from an incorrect use of pointers, not uncommon 
in C programs, will not occur in Fortran, which has no pointers. 

Diverse programming languages may have diverse libraries and compilers, 
which the user hopes will have uncorrelated (or, even better, negatively correlated) 
bugs. 

Certain programming languages may be more attuned to a given problem than 
others. For example, many would claim that Lisp is a more natural language in 
which to code some artificial intelligence (AI) algorithms than are C or Fortran. 
In other words. Lisp's expressive power is more congruent to some AI problems 
than that of C or Fortran. In such a case, an interesting problem arises. Should 
all versions use the language that is well attuned to the problem or should we 
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force some versions to be written in other languages that are less suited to the 
application? If all the versions are written in the most suitable language, we can 
expect that their individual error rate will be lower; on the other hand, the different 
versions may experience correlated errors. If they are written in diverse languages, 
the individual error rates of the versions written in the "poorer " languages may be 
greater, but the overall error rate of the N-version system may be lower if these bugs 
do not give rise to as many correlated errors. A similar comment applies to the use 
of diversity in other dimensions, such as development environments or tools. This 
trade-off is difficult to resolve without extensive—and expensive—experimental 
work. 

Use Diverse Development Tools and Compilers. This may make possible "no- 
tational diversity" and thereby reduce the extent of positive correlation between 
bugs. Since tools can themselves be faulty, using diverse tools for different versions 
may allow for greater reliability. A similar remark applies to compilers. 

Use Cognitively Diverse Teams. By cognitive diversity, we mean diversity in the 
way that people reason and approach problems. If teams are constituted to ensure 
that different teams have different approaches to reasoning, this can potentially 
give rise to software that has fewer correlated bugs. At the moment, however, 
procedures to ensure such cognitive diversity are not available. 


Other Issues in N-Version Programming 

Back-to-Back Testing. Having multiple versions that solve the same problem 
gives us the opportunity to test them back to back. The testing process consists of 
comparing their outputs for the same input, which helps identify noncoincident 
bugs. 

In addition to comparing the overall outputs, designers have the option of com¬ 
paring corresponding intermediate variables. Figure 5.7 shows an idealized exam¬ 
ple. We have three versions: VI, V2, V3. In addition to their final outputs, the 
designers have identified two points during their execution when corresponding 
variables are generated. These can be compared to provide additional back-to- 
back checks. 

Using intermediate variables can provide increased observability into the be¬ 
havior of the programs, and may identify defects that are not easily observable at 
the outputs. However, defining such variables constrains the developers to pro¬ 
ducing these variables and may reduce program diversity. 

Using Diverse Hardware and Operating Systems. The output of the system de¬ 
pends on the interaction between the application software and its platform, mainly 
comprising the operating system and the processor. Both processors and operating 
systems are notorious for the bugs they contain. It is, therefore, a good idea to com- 
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intermediate variables intermediate variables 


FIGURE 5.7 Example of intermediate variables in back-to-back testing. 

plement software design diversity with hardware and operating system diversity, 
by running each version on a different processor type and operating system. 

Cost of N-Version Programming. Software is expensive to develop, and creat¬ 
ing N versions rather than one is more expensive still. Very little information is 
publicly available about the cost of developing N versions: for a pointer to a case 
study, see the Further Reading section. According to that study, the overhead of 
developing an additional version varies from 25% to 134% of the single-version 
cost. This is an extremely wide range! 

A first-order estimate is that developing N versions is N times as expensive as 
developing a single version. However, some parts of the development process may 
be common. For instance, if all versions work off the same specifications, only one 
set of specifications needs to be developed. Qn the other hand, the management 
of an N-version project imposes overheads not found in traditional software de¬ 
velopment. Still, costs can be kept under control by carefully identifying the most 
critical portions of the code and only developing N versions for these. 

Producing a Single Good Version Versus Many Versions. Given a total time 
budget, consider two choices: (a) develop a single version (over which we lav¬ 
ish the entire allocated time), and (b) develop N versions. Unfortunately, software 
reliability modeling is not yet sufficiently advanced for us to make an effective 
estimate of which would be better and under what circumstances. 

Experimental Results. A few experimental studies have been carried out into 
the effectiveness of N-version programming. Published results are generally only 
available for work carried out in universities, and it is not clear how the results 
obtained by using student programmers would change if professional and experi¬ 
enced programmers were used. 

One typical study was conducted at the University of Virginia and the Univer¬ 
sity of California at Irvine. The study had a total of 27 students write code for an 
anti-missile application. The students ranged from some with no prior industrial 
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experience to others with over 10 years. All versions were written in Pascal and 
run on a Prime machine at the University of Virginia and a DEC VAX 11 /750 at the 
University of California at Irvine. A total of 93 correlated bugs were identified by 
standard statistical hypothesis-testing methods: if the versions had been stochas¬ 
tically independent, we would have expected no more than about five. Interest¬ 
ingly, no correlation was observed between the quality of the programs produced 
and the experience of the programmer. A similar conclusion—that versions were 
not stochastically independent—was drawn from another experiment, conducted 
under NASA auspices, by North Carolina State University, the University of Cali¬ 
fornia at Santa Barbara, the University of Virginia, and the University of Illinois. 


5.4 Recovery Block Approach 

Similarly to N-version programming, the recovery block approach also uses mul¬ 
tiple versions of software. The difference is that in the latter, only one version runs 
at any one time. If this version should be declared as failing, execution is switched 
to a backup. 

5.4.1 Basic Principles 

Figure 5.8 illustrates a simple implementation of this method. There is a primary 
version and three secondary versions in this example. Only the primary is initially 
executed. When it completes execution, it passes along its output to an acceptance 
test, which checks to see if the output is reasonable. If it is, then the output is ac¬ 
cepted by the system. If not, then the system state is rolled back to the point at 
which the primary started computation, and secondary 1 is invoked. If this suc¬ 
ceeds (the output passes the acceptance test), the computation is over. Otherwise, 
we roll the system back to the beginning of the computation, and then invoke sec¬ 
ondary 2. We keep going until either the outcome passes an acceptance test or we 
run out of secondaries. In the latter case, the recovery block procedure will have 
failed, and the system must take whatever corrective action is needed in response 
(e.g., the system may be put in a "safe" state, such as a reactor being shut down). 

The success of the recovery block approach depends on: (a) the extent to which 
the primary and various secondaries fail on the same inputs (correlated bugs), and 
(b) the quality of the acceptance test. These clearly vary from one application to the 
next. 

5.4.2 Success Probability Calculation 

Let us set up a simple mathematical model for the success probability of the re¬ 
covery block approach, under the assumption that the different versions fail inde¬ 
pendently of one another. We can use this model to determine which parameters 
most affect the software failure probability We use the following notation: 
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FIGURE 5.8 Recovery block structure with three secondaries. 


£ the event that the output of a version is erroneous 
T the event that the test reports that the output is wrong 
/ the failure probability of a version 
s the test sensitivity 

a the test specificity 

n the number of available software versions (primary plus secondaries) 
Thus, 

/ = £{£}, s = P{T|£}, a=P{E\T] 

For the scheme to succeed, it must succeed at some stage i, 1 ^ i ^ n. This will 
happen if the test fails stages 1 ,...,i — 1 (causing the scheme to go to the next 
version), and at stage i the version's output is correct and it passes the test. We 
now have 


ProbfSuccess in stage i} = [PfT}] 1 1 P{£ IT T} 
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n 

Prob{Scheme is successful} = [i- > {7"}] ? 1 P{£ n T} 

i= l 


P{E n T} = P{T|£}P{£} = sf 

P{E n T} sf 

P{T} = — -- = J- 

P{£|T} a 

P{£|T} = 1 — P{£|T} = 1 — cr 

P{£ n T} = P{£|T}P{T} = (1 - cr) — 

a 


P{E} = 1 — P{E} = 1 —/ 

P{£ nt} = P{£} - P{£ n T} = (1 -/) - (1 - cr)- 

O' 

Substituting Equations 5.3 and 5.4 into Equation 5.2 yields 


(5.2) 


(5.3) 


(5.4) 


n r s c 

Prob{Scheme is successful} = / — 

Z —' CT 

«=1 L 


i—l 


(1 -/)-(!-a) 5 / 

(J 


1-^ L 


(1 -/)-(!-a) 


sf 


(5.5) 


Equation 5.5 can be examined to determine the effect of the various parameters 
on the success probability of the scheme. One such analysis is shown in Figure 5.9 
for a recovery block structure with one primary and two secondaries (n — 3) and 
two values of the acceptance test sensitivity and specificity, namely, 0.95 and 0.85. 
For these parameter values, the test sensitivity has a greater impact on the success 
probability than its specificity. 


5.4.3 Distributed Recovery Blocks 

The structure of the distributed recovery block is shown in Figure 5.10, where we 
consider the special case with just one secondary version. The two nodes carry 
identical copies of the primary and secondary. Node 1 executes the primary, while, 
in parallel, node 2 executes the secondary If node 1 fails the acceptance test, the 
output of node 2 is used (provided that it passes the acceptance test). The output 
of node 2 can also be used if there is a watchdog timer and node 1 fails to produce 
an output within a prespecified time. 

Once the primary copy fails, the roles of the primary and secondary copies are 
reversed. Node 2 continues to execute its copy, which is now treated as the pri¬ 
mary The execution by node 1 of what was previously the primary copy is used 
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FIGURE 5.9 Success probability of the recovery block structure for n = 3 and two values 
of the acceptance test sensitivity s and specificity a. 


NODE 1 NODE 2 



FIGURE 5.10 Distributed recovery block structure. 


as a backup. This continues until the execution by node 2 is flagged as erroneous, 
in which case the system toggles back to using the execution by node 2 as a backup. 

Because the secondary is executed in parallel with the primary, we do not have 
to wait for the system to be rolled back and the secondary to be executed: the 
execution is overlapped with that of the primary. This saves time, and is useful 
when the application is a real-time system with tight task deadlines. 

Our example has included just two versions; the scheme can obviously be ex¬ 
tended to an arbitrary number of versions. If we have n versions (primary plus 
n — 1 secondaries), we will run all n in parallel, one on each processor. 
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Preconditions, Postconditions, and 
Assertions 

Preconditions, postconditions, and assertions are forms of acceptance tests that 
are widely used in software engineering to improve software reliability. The pre¬ 
condition of a method (or function, or subroutine, depending on the programming 
language) is a logical condition that must be true when that method is called. For 
example, if we are operating in the domain of real numbers and invoke a method 
to calculate the square root of a number, an obvious precondition is that this num¬ 
ber must be non-negative. 

A postcondition associated with a method invocation is a condition that must be 
true when we return from a method. For example, if a natural logarithm method 
was called with input X, and the method returns Y, we must have e' 1 = X (within 
the limits of the level of precision being used). 

Preconditions and postconditions are often interpreted in contractual terms. 
The function invoking a method agrees to ensure that the preconditions are met 
for that method: if they are not, there is no guarantee that the invoked method will 
return the correct result. In return, the method agrees to ensure that the postcon¬ 
ditions are satisfied upon returning from it. 

Assertions are a generalization of preconditions and postconditions. An asser¬ 
tion tests for a condition that must be true at the point at which that assertion is 
made. For example, we know that the total node degree of an undirected graph 
must be an even number (since each edge is incident on exactly two nodes). So, 
we can assert at the point of computation of this quantity that it must be even. If 
it turns out not to be so, an error has occurred; the response to the failure of an 
assertion is usually to notify the user or carry out some other appropriate action. 

Preconditions, postconditions, and assertions are used to catch errors before 
they propagate too far. The programmer has the opportunity to provide for cor¬ 
rective action to be taken if these conditions are violated. 

Exception-Handling 

An exception is raised to indicate that something has happened during execution 
that needs attention, e.g., an assertion has been violated due to either hardware or 
software failure. When an exception is raised, control is generally transferred to a 
corresponding exception-handler, which is a routine that takes the appropriate ac¬ 
tion. For example, if we have an arithmetic overflow when executing the operation 
y — a*b, then the result as computed will not be correct. This fact can be signaled 
as an exception, and the system must react appropriately 

Effective exception-handling can make a significant contribution to system fault 
tolerance. For this reason, a substantial fraction of the code in many current pro¬ 
grams is devoted to exception-handling. Throughout this discussion, we will as¬ 
sume that an exception is triggered in some routine that is invoked by some other 
routine or by an operator external to the system. 
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Exceptions can be used to deal with (a) domain or range error, (b) an out-of-the- 
ordinary event (not failure) that needs special attention, or (c) a timing failure. 

Domain and Range Errors 

A domain error happens when an illegal input is used. For example, if X and Y 
are defined as real numbers and the operation X = s/Y is attempted with Y = —1, 
a domain error will have occurred, the value of Y being illegal. On the other hand, 
if X and Y are complex numbers, this operation will be perfectly legal. 

A range error occurs when the program produces an output or carries out an 
operation that is seen to be incorrect in some way. Examples include the following: 

■ Reading from a file, and encountering an end-of-file while we should still 
be reading data. 

■ Producing a result that violates an acceptance test embedded within the 
program. 

■ Trying to print a line that is too long. 

■ Generating an arithmetic overflow or underflow. 

Out-of-the-Ordinary Events 

Exceptions can be used to ensure special handling of rare, but perfectly normal, 
events. For example, if we are reading a list of items from a file and the routine has 
just read the last item, it may trigger an exception to notify the invoker that this 
was the last item and that nothing further is available to be read. 

Timing Failures 

In real-time applications, tasks have deadlines associated with them. Missing a 
deadline can trigger an exception. The exception-handler then decides what to do 
in response: for instance, it may switch to a backup routine. 

5.6.1 Requirements from Exception-Handlers 

What do we look for in an exception-handling system? First, it should be easy to 
program and use. It should be modular and thus easily separable from the rest 
of the software. It should certainly not be mixed in with the other lines of code 
in a routine: that would obscure the purpose of the code and render it hard to 
understand, debug, and modify. 

Second, exception-handling should not impose a substantial overhead on the 
normal functioning of the system. We expect exceptions to be, as the term suggests, 
invoked only in exceptional circumstances: most of the time they will not be raised. 
The well-known engineering principle that the common case must be made fast 
requires that the exception-handling system not inflict too much of a burden in the 
usual case when no exceptional conditions exist. 
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Third, exception-handling must not compromise the system state. That is, we 
must be careful not to render the system state inconsistent during exception¬ 
handling. This is especially important in the exception-resume approach, which 
we discuss in the next section. 

5.6.2 Basics of Exceptions and 
Exception-Handling 

When an exception occurs, it is said to be thrown, raised, or signaled. Some authors 
distinguish between the raising and the signaling of an exception: the former is 
when the exception notification is to the module within which it occurred; the 
latter when this notification propagates to another module. 

Internal and External Exceptions 

Exceptions can be either internal or external. An internal exception is one which is 
handled within the very same module in which it is raised. An external exception, 
on the other hand, propagates elsewhere. For example, if a module is called in a way 
that violates the specifications of its interface, an interface exception is generated, 
which has to be dealt with outside the called module. 

Propagation of Exceptions 

Figure 5.11 provides an example of exception-propagation. Here, module A calls 
module B, which executes normally until it encounters exception c. B does not 
have the handler for this exception, so it propagates the exception back to its call¬ 
ing module. A, which executes the appropriate handler. If no handler can be found, 
the execution is terminated. 

Automatically propagating exceptions can violate the principle of information 
hiding. Information hiding involves the separation of the interface definition of a 
routine (method, function, subroutine) from the way it is actually designed and 
implemented. The interface is public information; in contrast, the caller of the rou¬ 
tine does not need to know the details of the design and implementation of every 
routine being called. Not only does this reduce the burden on the caller, it also 
makes it possible to improve the implementation without any changes having to 
be propagated to outside the routine. 

The invoker (the calling routine) is at a different level of abstraction from the 
invoked routine. In the example just considered, suppose that some variable X in 
the invoked routine violated its range constraint. This variable may not even be 
visible to the invoker. 

To get around this problem, we may replace automatic propagation with ex¬ 
plicit propagation, in which the propagated information is modified to be conso¬ 
nant with scope rules. For example, if the variable X is invisible to the invoker, it 
may be told that there was a violation of a range constraint within the invoked 
routine. It will then have to make the best use it can of this information. 
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Exception-Terminate and Exception-Resume 

Exceptions may be classified into the exception-terminate (ET) and exception-resume 
(ER) categories. If an ET is generated while executing some module M, then ex¬ 
ecution of M is terminated, the appropriate exception-handler is executed, and 
control is returned to the routine which called module M. However, if an ER is 
generated, the exception-handling routine attempts to patch up the problem and 
returns control to M, which resumes execution. 

Exception-terminates are much simpler to handle than are exception-resumes. 
Suppose, for example, that module A calls module B. While executing, module B 
encounters an exception. If the exception-terminate approach is taken, B will re¬ 
store its state to what it was at its invocation, signal the exception, and terminate. 
Control is handed back to A. A thus has to deal only with the following two possi¬ 
bilities: either B executes without exceptions and returns a result, or B encounters 
an exception and terminates with its state unchanged from before it was called. 

By contrast, if the exception-resume approach is taken, B will suspend execution 
and control is transferred to the appropriate exception-handler. After the handler 
finishes its task, it has the option of returning control to B, which can then resume. 
Alternatively, the handler could send control elsewhere (it depends on the seman¬ 
tics of the handler). Following this, control returns to A. Thus, when A gets back 
control after a call to B, we have the following three possibilities. The first is an 
exception-free execution of B, which poses no difficulties. The second is that an 
exception was encountered, which was dealt with by the exception-handler, after 
which control was returned to B, which resumes and finishes execution. The third 
possibility is that the exception-handler transfers control elsewhere in an attempt 
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to handle the exception. After all this, control is handed back to A, possibly with 
B being in an inconsistent state. This third possibility requires that the program¬ 
mer who wrote A knows the semantics of the exception-handler, which may not 
be realistic. 

After the exception has been handled and control has been returned to the in¬ 
voking routine, several options are available, based on what kind of exception 
occurred. 

■ Domain Error. We may choose to re-invoke, with corrected operands. If this 
is not possible, the entire computation may have to be abandoned. 

■ Range Error. There are cases in which some acceptable value may be substi¬ 
tuted for the incorrect one which triggered the exception, and the execution 
resumed. For example, if we have an underflow, we may choose to replace 
that result by 0 and carry on. If we have additional versions of the software, 
we may invoke alternatives. Or we may just retry the whole operation, hop¬ 
ing that it arose from some transient failure which has since gone away, or 
from some combination of concurrent events that is unlikely to recur. 

■ Out-of-the-Ordinary Events. These must be identified by the programmer 
and handled on a case-by-case basis. 

■ Timing Failures. If the routine is iterative, we may simply use the latest 
value. For example, if the invoked routine was searching for the optimum 
value of some function, we may decide to use the best one it has found 
so far. Alternatively, we may switch to another version of the software (if 
available) and hope that it will not suffer from the same problem. If we are 
using the software in a real-time system that is controlling some physical 
device (e.g., a valve), we may leave the setting unchanged or switch to a 
safety position. 

It is important to stress that many exceptions can only be properly dealt with in 
context: it is the context that determines what the appropriate response should be. 
For example, suppose we encounter an arithmetic overflow. In some applications, 
it may be perfectly acceptable to set the result to oo and carry on. In others, it may 
not, and may require a far more involved response. 

5.6.3 Language Support 

Older programming languages generally have very little built-in exception han¬ 
dling support. By contrast, more recent languages such as C++ and Java have ex¬ 
tensive exception-handling support. For example, in Java, the user can specify ex¬ 
ceptions that are thrown if certain conditions occur (such as the temperature of a 
nuclear reactor exceeding a prespecified limit). Such exceptions must be caught by 
an exception-handling routine, which deals with them appropriately (by raising 
an alarm or printing some output). 
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5.7 Software Reliability Models 

As opposed to the well-established analytical models of hardware reliability, the 
area of modeling error rates and software reliability is relatively young and often 
controversial. There are many models in the literature, which sometimes give rise 
to contradictory results. Our inability to accurately predict the reliability of soft¬ 
ware is a matter of great concern, since software is often the major cause of system 
unreliability. 

In this section, we briefly describe three models which are a sampling of the 
software reliability models available. Unfortunately, there is not yet enough evi¬ 
dence to determine which model would be best for what type of software. Models 
are useful in providing general guidance as to what the software quality is; they 
should not be used as the ultimate word on the actual numerical reliability of any 
piece of software. 

In what follows we distinguish between a defect (or a bug) which exists in the 
software when it is written and an error which is a deviation of the program oper¬ 
ation from its exact requirements (as the result of a defect) and occurs only when 
the program is running (or is being tested). Once an error occurs, the bug caus¬ 
ing it can be corrected; however, other bugs still remain. An accepted definition of 
software reliability is the probability of error-free operation of a computer program in a 
specified environment for a specified time. To calculate this probability, the notion of 
softiuare error rate must be introduced. Software reliability models attempt to pre¬ 
dict this error rate as a function of the number of bugs in the software, and their 
purpose is to determine the length of testing (and subsequent correcting) required 
until the predicted future error rate of the software goes below some predeter¬ 
mined threshold (and the software can be released). 

All three models described next have in common the following assumptions: 
The software has initially some unknown number of bugs. It is tested for a pe¬ 
riod of time, during which time some of the bugs cause errors. Whenever an error 
occurs, the bug causing it is fixed (fixing time is negligible) without causing any 
additional bugs, thus reducing the number of existing bugs by one. The models 
differ in their modeling of /.(f), the software error rate at time f, and consequently, 
in the software reliability prediction. 

5.7.1 Jelinski-Moranda Model 

This model assumes that at time 0 the software has a fixed (and finite) number 
N( 0) of bugs, out of which N(t) bugs remain at time t. The error process is a non- 
homogeneous Poisson process, i.e., a Poisson process with a rate X(t) that may vary 
with time. The error rate X(t) at time t is assumed to be proportional to N(f), 

X(t) — cN(t) (for some constant c) 

Note that X(t) in this model is a step function; it has an initial value of aq = /.(0) = 
cN( 0), decreases by c whenever an error occurs and the bug that caused it is cor- 
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reefed, and is constant between errors. The (testing, not including fixing) time be¬ 
tween consecutive errors (say i and i + 1) is exponentially distributed with pa¬ 
rameter X(t), where f is the time of the zth error. The reliability at time f, or the 
probability of a error-free operation during [0, f] is therefore 

R(t) = e~ Xot (5.6) 

Given an error occurred at time r, the conditional future reliability, or the condi¬ 
tional probability that the following interval of length f, namely [r, r + t] will be 
error-free is 

R(t\r) — e -A(T)t (5.7) 

As the software runs for longer and longer, more bugs are caught and purged from 
the system, and so the error rate declines and the future reliability increases. 

The obvious objection to this model is that it assumes that all bugs contribute 
equally to the error rate, as expressed by the constant of proportionality c. Actually, 
not all bugs are created equal: some of them are exercised more often than others. 
Indeed, the more troublesome bugs are those that are not exercised often: these are 
extremely difficult to catch during testing. 


5.7.2 Littlewood-Verrall Model 

Similarly to the first model, this model assumes a fixed and finite number, N( 0), of 
initial bugs, out of which N(t) remain at time f. The difference is that this model 
considers M(f) —the number of bugs discovered and corrected during [0, f] — 
rather than N(t) ( M(f ) = N( 0) — N(f)). 

The errors occur according to a nonhomogeneous Poisson process with rate /.(f), 
but X(t), rather than being deterministic, is considered a random variable with a 
Gamma density function. The Gamma density function has two parameters a and 
i//, where the parameter i/r is a monotonically increasing function of M(f) 

[VdM(f))] 0 T 0 '- 1 e-^ M( W , , 

r , x - (5-8) 

r» 

where F(x) = / 0 °° e - ^]/ 1-1 dy is the Gamma function (defined in Section 2.2). 

The Gamma density function was chosen for practical reasons. It lends itself to 
analysis, and its two parameters provide a wide range of differently shaped den¬ 
sity functions, making it both mathematically tractable and flexible. The expected 
value of the Gamma density function in Equation 5.8 is > so ^e P re " 

dieted error rate will decrease and the reliability will increase as the software is 
run for longer periods of time and more bugs are discovered. 

Calculating the reliability requires some integrations, which we omit: see the 
Further Reading section for a pointer to the analysis. After such analysis, we obtain 
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the following expressions for the software reliability: 


w-i'+mT 

R(f |T) =( 1 +^(M ( r))) 


( 5 . 9 ) 


( 5 . 10 ) 


5.7.3 Musa-Okumoto Model 

This model assumes an infinite (or at least very large) number of initial bugs in 
the software, and similarly to the previous model, uses M(f)—the number of bugs 
discovered and corrected during time [0, f]. We use the following notation: 


/-O the error rate at time 0 
c a constant of proportionality 

/x(f) the expected number of errors experienced during [0, f] (n(t) = E(M(t))) 


Under this model, the error rate after testing for a length of time f is given by 

X(t) = l 0 e _c,i(t) 

The intuitive basis for this model is that, when testing first starts, the "easiest" 
bugs are caught quite quickly After these have been eliminated, the bugs that still 
remain are more difficult to catch, either because they are harder to exercise or 
because their effects get masked by subsequent computations. As a result, the rate 
at which an as-yet-undiscovered bug causes errors drops exponentially as testing 
proceeds. 

From the definition of X(t) and we have 

^=k(f) = A. 0 e~ c ^ 

at 

The solution of this differential equation is 

ln(k 0 cf +1) 


and 


Ht) = 


kp 

kp ct + 1 


The reliability R(t) can now be calculated as 


R(t) = e -/o A(z)dz = e~^ l) = (1 + Xoctr'c 
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(a) Dependence on Xq 



Ag=l; curve labels indicate c 
(b) Dependence on c 


FIGURE 5.12 Error rates according to the Musa-Okumoto model. 


and the conditional reliability R(t\r) is 

_i 

R(t\r) = e - ^ r+,A(z)dz = = (\-\ _ X ° Ct \ 

\ 1 + XqCT ) 

In Figure 5.12, we show how the error rate varies with time for the Musa- 
Okumoto model. Note the very slow decay of the error rate. To get the error rate of 
software down to a sufficiently low point (following this model) clearly requires a 
significant amount of testing. 
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5.7.4 Model Selection and Parameter Estimation 

The literature on software error models is vast and varied. In the previous sec¬ 
tion, we have only outlined a very small subset of the existing models. Anyone 
planning to use one of these models has two problems. First, which of the many 
available models would be appropriate? Second, how are the model parameters to 
be estimated? 

Selecting the appropriate model is not easy. The American Institute of Aeronau¬ 
tics and Astronautics (AIAA) recommends using one of the following four mod¬ 
els, three of which we covered in this chapter: the Jelinski-Moranda, Littlewood- 
Verrall, Musa-Okumoto, and Schneidewind models. However, as mentioned ear¬ 
lier, no comprehensive and openly accessible body of experimental data is avail¬ 
able to guide the user. This is in sharp contrast to hardware reliability modeling, 
where a systematic data collection effort formed the basis for much of the theory. 
Software reliability models are based on plausibility arguments. The best that one 
can suggest is to study the error rate as a function of testing and guess which 
model it follows. For example, if the error rate seems to exhibit an exponential 
dependence on the testing time, then we may consider using the Musa-Okumoto 
model. Once a suitable model is selected, the parameters can be estimated by us¬ 
ing the Maximum Likelihood method, which is outlined in Chapter 10. Chapter 10 
also outlines the difficulty in accurately predicting the reliability of a highly reli¬ 
able system (whether due to hardware failures or to software errors). 


5.8 Fault-Tolerant Remote Procedure Calls 

A Remote Procedure Call (RPC) is a mechanism by which one process can call 
another process executing on some other processor. RPCs are widely used in dis¬ 
tributed computing. 

We will describe next two ways of making RPCs fault tolerant: both are based 
on replication and bear similarities to the problem of managing replicated data 
(see Section 3.3). Throughout, we will assume that processes are fail-stop. 

5.8.1 Primary-Backup Approach 

Each process is implemented as primary and backup processes, running on sepa¬ 
rate nodes. RPCs are sent to both copies, but normally only the primary executes 
them. If the primary should fail, the secondary is activated and completes the ex¬ 
ecution. 

The actual implementation of this approach depends on whether the RPCs are 
retryable or nonretryable. A retryable RPC is one which can be executed multiple 
times without violating correctness. One example is the reading of some database. 
A nonretryable RPC should be completed exactly once. For example, incrementing 
somebody's bank balance is a nonretryable operation. 
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FIGURE 5.13 Example of a circus. 

If the system is running only retryable operations, then implementation of the 
primary-backup approach is quite straightforward. On the other hand, if non- 
retryable operations may be involved, it is important to ensure that these be com¬ 
pleted exactly once, even if multiple processes are used for fault tolerance. This 
can be done by the primary process checkpointing its operations on the backup. 
Should the primary fail while executing the RPC, the backup can pick up from the 
last checkpoint (see Chapter 6). 

5.8.2 The Circus Approach 

The circus approach also involves the replication of processes. Client and server 
processes are each replicated. Continuing the circus metaphor, these replicated 
sets are called troupes. 

This system is best described through an example. Figure 5.13 shows four repli¬ 
cates of a client process, making identical calls to four replicates of a server process. 
Each call has a sequence number associated with it, that uniquely identifies it. 

A server waits until it has received identical calls from each of the four client 
copies, or the waiting times out, before executing the RPC. The results are then sent 
back to each of the clients. These replies are also marked by a sequence number to 
uniquely identify them. 

A client may wait until receiving identical replies from each of the server copies 
before accepting the input (subject to a timeout to prevent it from waiting forever 
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for a failed server process). Alternatively, it could simply take the first reply it gets 
and ignore the rest. 

An additional complication must be taken care of: it is possible for multiple 
client troupes to be sending concurrent calls to the same server troupe. In such a 
case, each member of the server troupe must, to ensure correct functioning, serve 
the calls in exactly the same order. 

There are two ways of ensuring that this order is preserved, called the opti¬ 
mistic and pessimistic approaches. In the optimistic approach, we make no special 
attempt to ensure preservation of the order. Instead, we let everything run freely 
and then check to see if they preserved order. If so, we accept the outputs, oth¬ 
erwise, we abort the operations and try again. This approach will perform very 
poorly if ordering is frequently not preserved. 

The pessimistic approach, on the other hand, has built-in mechanisms which 
ensure that order is preserved. 

Let us now present a simple optimistic scheme. Each member of the server 
troupe receives requests from one or more client troupes. When a member com¬ 
pletes processing and is ready to commit, it sends a ready_to_commit message 
to each element of the client troupe. It then waits until every member of the 
client troupe acknowledges this call, before proceeding to commit. On the client 
side, a similar procedure is followed: the client waits until it has received the 
ready_to_commit message from every member of the server troupe, before acknowl¬ 
edging the call. Once the server receives an acknowledgment from each member 
of the client troupe, it commits. 

This approach ensures correct functioning by forcing deadlock if the serial order 
is violated. For example, let Ci and C 2 be two client troupes making concurrent 
RPCs pi and P 2 to a server troupe consisting of servers Si and S 2 . Let us see what 
happens if Si tries to commit pi first and then P 2 , while S 2 works in the opposite 
order. 

Once Si is ready to commit pi, it sends a ready_to_commit message to each mem¬ 
ber of Ci, and waits to receive an acknowledgment from each of them. Similarly, S 2 
gets ready to commit p 2 , and sends a ready_to_commit message to each member of 
C 2 . Now, members of each client troupe will wait until hearing a ready_to_commit 
from both Si and S 2 - Since members of Ci will not hear from S 2 and members of 
C 2 will not hear from Si, there is a deadlock. Algorithms exist to detect such dead¬ 
locks in distributed systems. Once the deadlock is detected, the operations can be 
aborted before being committed, and then retried. 


5.9 Further Reading 

An excellent introduction to the intrinsic difficulties of making software run cor¬ 
rectly can be found in [7,8]. A contrary view, arguing that complexity can be suc¬ 
cessfully encapsulated in software modules to render it invisible to users of those 
modules (human or software routines that call these modules) is presented in [12]. 
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[28] is regarded as a classic in the field of software safety. Other excellent, gen¬ 
eral, references for software fault tolerance are [20,43]. 

Wrappers are motivated in [46]. Systematic design procedures for wrappers are 
discussed in [38,39]. In [41], the authors describe how to wrap a kernel. In [15], 
wrappers are used to prevent heap-smashing attacks. Finally, [19] describes the 
wrapping of Windows NT software. 

Software rejuvenation has a long history. People were rebooting their comput¬ 
ers when they failed or hung long before it was called rejuvenation. However, its 
formal use to enhance software reliability is fairly new. A good introduction can 
be found in [17,21], The use of software rejuvenation, including a tool to imple¬ 
ment it, is described in [10]. A method by which to estimate the rate of software 
aging (and hence to determine when to rejuvenate) is provided in [18]. The ap¬ 
plication of rejuvenation to cluster systems, including a discussion of the relative 
merits of time-based and prediction-based approaches, can be found in [44], and 
the smoothing approach they use for prediction was proposed in [11]. 

Data diversity is described in some detail in [1] where experimental results are 
provided of a radar tracking application. A good reference to SIHFT techniques 
which also includes a detailed overview of related schemes appears in [36]. The 
IEEE floating-point number representation and the precision of floating-point op¬ 
erations are discussed in many books, e.g., [25]. The RESO technique is described 
in [37], 

A good introduction to N-version programming can be found in [3]. A design 
paradigm is provided in [2], The observation that requirements specifications are 
the cause of most software bugs is stated in multiple places, for example, see [5,28, 
45], 

A survey of modeling software design diversity can be found in [32], This chap¬ 
ter draws on the foundational work of the authors of [14], An excellent descrip¬ 
tion of the ways by which forced diversity among versions can be obtained can be 
found in [29]. An experiment in design diversity is described in [6]. Experiments 
to determine if the versions are stochastically independent or not have not been 
without some controversy: see [26] regarding some early experiments in this field. 
A recent study of the impact of specification language diversity on the diversity of 
the resulting software was published in [49]. An investigation into whether the re¬ 
sults from different inspectors of software are independent of one another appears 
in [34]. 

The cost of creating multiple versions has not been systematically surveyed: an 
interesting initial study appears in [24]. 

An introduction to exceptions and exception-handling can be found in [13], 
and a useful survey can be found in [9]. A discussion of the comparative merits 
of exception-terminate and exception-resume is in [33]. The exception-handling 
mechanism in individual languages is generally treated in some detail in their 
language manuals and books: for example, see [4], Exception-handling in object- 
oriented systems is surveyed in [16] and in real-time software in [27], A good out¬ 
line of issues in distributed systems can be found in [48]. 
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There is a substantial literature on software reliability models (also called soft¬ 
ware reliability growth models). A useful survey can be found in [47], A recent 
paper which uses a Bayesian approach is [40]. The three models discussed in this 
chapter have been presented in [23,30,31,35]. 

A good discussion of fault-tolerant remote procedure calls can be found in [22], 


5.10 Exercises 


1. Join N — 1 other people (N = 3,4,5 are probably the best choices) to carry out 
an experiment in writing N-version programs. Write software to solve a set of 
differential equations, using the Runge-Kutta method. Programs for doing so 
are widely available and can be used to check the correctness of each version 
produced. Pick one of these, and compare the output of each version against 
this reference program for a large number of test cases. Identify the number of 
bugs in each version and the extent to which they are correlated. 

2 . The correct output, z, of some system has as its probability density function the 
truncated exponential function (assume L is some positive constant): 


m= 


fie 

1 _ e -nL 

0 


if 0 ^ z < L 
otherwise 


If the program fails, it may output any value over the interval [0, L] with equal 
probability. The probability that the program fails on some input is q. 

The penalty for putting out an incorrect value is Jihad* the penalty for not pro¬ 
ducing any output at all is 7r st0 p. 

We want to set up an acceptance test in the form of a range check, which rejects 
outputs outside the range [0, a]. Compute the value of a for which the expected 
total penalty is minimized. 

3 . In this problem, we will use simulation to study the performance of a bug 
removal process after an initial debugging (see Chapter 10). 

Assume you have a program that has N possible inputs. There are k bugs in the 
program, and bug i is activated whenever any input in the set F, is applied. It is 
not required that F, fl F ; = 0, and so the bug sets may overlap. That is, the same 
input may trigger multiple bugs. If F, has k elements in H, the elements of F, 
are obtained by randomly drawing k different elements from the set of possible 
inputs. 

Assume that you have a test procedure that applies inputs to the program. 
These inputs are randomly chosen from among the ones that have not been 
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applied so far. Also assume that when an input is applied which triggers one 
or more bugs, those bugs are immediately removed from the program. 

Plot the number of software bugs remaining in the program as a function of the 
number of inputs applied. Use the following parameters in your simulation: 

i. N = 10 8 , and the number of elements in F, is uniformly distributed over the 
set {1,2,...,«}. 

a. The total number of bugs is b = 1000. n — 50 and a total of 10 6 randomly 
chosen test inputs are applied. 

b. Repeat (a) for n — 75. 

c. Repeat (a) for n — 100. 

ii. N = 10 8 ; and the number of elements in F; has the following probability 
mass function: 


ProbjF/ has k elements} 


p(i ~vt 1 

i-(i -vT 


where k— 1, 


n 


Apply 10 6 randomly chosen test vectors in all. As before, assume there are 

b = 1000 software bugs. 

a. n=50, p = 0.1. 

b. n = 75,p = 0.1. 

c. n = 100, p = 0.1. 

d. Repeat (a) to (c) for p = 0.2. 

e. Repeat (a) to (c) for p = 0.3. 

Discuss your results. 

4 . In this problem, we will use Bayes' law to provide some indication of whether 
bugs still remain in the system after a certain amount of testing. Suppose you 
are given that the probability of uncovering a bug (given that at least one exists) 
after t seconds of testing is 1 — e~ /;f . Your belief at the beginning of testing is 
that the probability of having at least one bug is q. (Equivalently, you think 
that the probability that the program was completely bug-free is p = 1 — q) 
After t seconds of testing, you fail to find any bugs at all. Bayes' law gives us 
a concrete way in which to use this information to refine your estimate of the 
chance that the software is bug-free: find the probability that the software is 
actually bug-free, given that yon have observed no bugs at all, despite t seconds of 
testing. 

Let us use the following notation: 
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■ A is the event that the software is actually bug-free. 

■ B is the event that no bugs were caught despite f seconds of testing. 

a. Show that 

Prob{A|B} = - P 

p + qe~^ 

b. Fix p — 0.1, and plot curves of Prob{A|B} against t for the following para¬ 
meter values: /x = 0.001,0.01,0.1, 0 < t < 10,000. 

C. Fix // = 0.01 and plot curves of Prob{A|B} against t for the following para¬ 
meter values: p = 0.1,0.3,0.5. 

d. What conclusions do you draw from your plots in (b) and (c) above? 

5 . Based on the expressions for sensitivity and specificity presented in Section 5.4, 
derive an expression for the probability of a false alarm (in a single stage). 

6. In the context of the SIHFT technique, the term data integrity has been defined 
as the probability that the original and the transformed programs will not both 
generate identical incorrect results. Show that if the only faults possible are 
single stuck-at faults in a bus (see Figure 5.4) and k is either —1 or 2' with 
l an integer, then the data integrity is equal to 1. Give an example when the 
data integrity will be smaller than 1. (Hint: Consider ripple-carry addition with 
*=-!•) 

7 . Compare the use of the AN code (see Chapter 3) to the RESO technique. Con¬ 
sider the types of faults that can be detected and the overheads involved in 
both cases. 
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Computers today are thousands of times faster than they were just a few 
decades ago. Despite this, many important applications take days or more of com¬ 
puter time. Indeed, as computing speeds increase, computational problems that 
were previously dismissed as intractable become practical. Here are some appli¬ 
cations that take a very long time to execute, even on today's fastest computers. 

1. Fluid-Flow Simulation. Many important physics applications require the sim¬ 
ulation of fluid flows. These are notoriously complex, consisting of large as¬ 
semblages of three-dimensional cells interacting with one another. Examples 
include weather and climate modeling. 

2 . Optimization. Optimally deploying resources is often very complex. For exam¬ 
ple, airlines must schedule the movement of aircraft and their crews so that 
the correct combination of crews and aircraft are available, with all the regula¬ 
tory constraints (such as flight crew rest hours, aircraft maintenance, and the 
aircraft types that individual pilots are certified for) satisfied. 

3 . Astronomy. N-body simulations that account for the mutual gravitational in¬ 
teractions of N bodies, the formation of stars during the merger of galaxies, 
the dynamics of galactic cluster formation and the hydrodynamic modeling of 
the universe are problems that can require huge amounts of time on even the 
fastest computers. 

4 . Biochemistry. The study of protein folding holds the potential for tailoring treat¬ 
ments to an individual patient's genetic makeup and disease. This problem is 
sufficiently complex to require petaflops of computing power. 

When a program takes very long to execute, the probability of failure during exe¬ 
cution, as well as the cost of such a failure, become significant. 

To illustrate this problem, we introduce the following analytical model, which 
we will use throughout this chapter. Consider a program that takes T time units to 
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execute if no failures occur during its execution. Suppose the system suffers tran¬ 
sient failures according to a Poisson process with a rate of X failures per time unit. 
Here, to simplify the derivation, we assume that transients are point failures, i.e., 
they induce an error in the system and then go away. All the computation done 
by the program prior to the error is lost; the system takes negligible time to re¬ 
cover from the failure. Some of these simplifying assumptions are removed in 
Section 6.3. 

Let £ be the expected execution time, including any computational work lost 
to failure. To calculate £, we follow standard conditioning arguments. We list all 
the possible cases, systematically work through each one, weigh each case with its 
probability of occurrence, and sum them all up to get the overall expected execu¬ 
tion time. 

It is convenient to break the problem down into two cases. Either (Case 1) there 
are no failures during the execution or (Case 2) there is at least one. If there are no 
failures during execution, the execution time is (by definition) T. The probability 
of no failures happening over an interval of duration T is e~ XT , so the contribution 
of Case 1 to the average execution time is Te~ kT . 

If failure does occur, things get a bit more complicated. Suppose that the first 
failure to hit the execution occurs r time units into the execution time T. Then, 
we have lost these r units of work and will have to start all over again. In such 
an event, the expected execution time will be r + E. The probability that the first 
failure falls in the infinitesimal interval [r, r + dr] is given by Xe~ Xr dr. 

r may be anywhere in the range [0, T], We remove the conditioning on r to 
obtain the contribution of Case 2 to the average execution time: 


/ 

J T=0 


l 

(r + E)Xe~ Xr dr = - 
X 


E-e 


-XT 


- + T + E 


Adding the contributions of Cases 1 and 2, we have 


£ = Te 


-XT 


I +E ~ e 


-XT 


l + T - 


Solving this equation for £, we obtain the (surprisingly simple) expression: 

£ = ' 1T -' 


X 


( 6 . 1 ) 


(6.2) 


We can see that the average execution time £ is very sensitive to T; indeed, it 
increases exponentially with T. The penalty imposed by the failure process can 
be measured by £ — T (the extra time wasted due to failures). When normalizing 
£ — T by the failure-free execution time T, we obtain i], a dimensionless metric of 
this penalty: 


E-T 


ar 


= T - 1 = 


1 


XT 


- 1 


( 6 . 3 ) 


Note that i] depends only on the product XT, the number of failures expected to 
strike the processor over the duration of an execution. 
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FIGURE 6.1 i] as a function of the expected number of failures. 

Figure 6.1 plots rj as a function of XT, showing that i] starts quite small, but then 
goes up rapidly. 


6.1 What Is Checkpointing? 

Let us start with an example to which almost everyone who has balanced a check¬ 
book can relate. We have a long list of numbers to add up using a hand calculator. 
As we do the addition, we record on a slip of paper the partial sum so far for, 
say, every five additions. Suppose we hit the Clear button by mistake after adding 
up the first seven numbers. To recover, all we need to do is to add to the partial 
sum recorded after five additions, the sixth and seventh terms (see Figure 6.2). We 
have been saved the labor of redoing all six additions: only two need to be done 
to recover. This is the principle of checkpointing: the partial sums are checkpoints. 

In general, a checkpoint is a snapshot of the entire state of the process at the mo¬ 
ment it was taken. It represents all the information that we would need to restart 
the process from that point. We record the checkpoint on stable storage, i.e., storage 
in whose reliability we have sufficient confidence. Disks are the most commonly 
used medium of stable storage: they can hold data even if the power supply is 
interrupted (so long as there is no physical damage to the disk surface), and enor¬ 
mous quantities of data can be stored very cheaply. This is important because a 
checkpoint can be very large: tens or hundreds of megabytes (or more) is not un¬ 
common. 

Occasionally, standard memory (RAM) that is rendered (relatively) nonvolatile 
by the use of a battery backup is also used as stable storage. When choosing a 
stable storage medium, it is important to keep in mind that nothing is perfectly 
reliable. When we use a particular device as stable storage, we are making the 
judgment that its reliability is sufficiently high for the application at hand. 

Two more terms are worth defining at this point. Taking a checkpoint increases 
the application execution time: this increase is defined as the checkpoint overhead. 
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Item number 

Amount 

Checkpoint 

1 

23.51 


2 

414.78 


3 

147.20 


4 

110.00 


5 

326.68 

1022.17 

6 

50.00 


7 

215.00 


8 

348.96 


9 

3.89 


10 

4.55 

1644.57 

11 

725.95 



FIGURE 6.2 Example of checkpointing. 


Checkpoint latency is the time needed to save the checkpoint. In a very simple sys¬ 
tem, the overhead and latency are identical. However, in systems that permit some 
part of the checkpointing operation to be overlapped with application execution, 
the latency may be substantially greater than the overhead. For example, suppose 
a process checkpoints by writing its state into an internal buffer. Having done so, 
the CPU continues executing the process, while another unit handles writing out 
the checkpoint from the buffer to disk. Once this is done, the checkpoint has been 
stored and is available for use in the event of failure. 

The checkpointing latency obviously depends on the checkpoint size. This can 
vary from program to program, as well as with time, during the execution of a 
single program. For example, consider the following contrived piece of C code: 

for (i = 0; i < 1000000; z'++) 

if (f(i) < min) {min =/(/); /min = /;} 
for (z = 0; i < 100; z'++) { 
for (/' = 0; j < 100; /++) { 
c[z][/]+ = i*j/ min; 

! 

1 

This program fragment consists of two easily distinguishable portions. In the first, 
we compute the smallest value of f(i) for 0 ^ i < 1,000,000, where/() is some func¬ 
tion specified in the program. In the second portion, we do a matrix multiplication 
followed by a division. 

A checkpoint taken when the program is executing the first portion need not 
be large. In fact, all we need to record are the program counter and the variables 
min and imin. (The system will usually record all the registers, but most of them 
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will not actually be relevant here). A checkpoint taken when the second portion 
is being executed must include the array c [ i ] [ j ] as it has been computed so 
far. 

The size of the checkpoint is therefore program-dependent. It may be as small 
as a few kilobytes or as large as several gigabytes. 

6.1.1 Why Is Checkpointing Nontrivial? 

From the preceding discussion, the reader may be wondering why checkpointing 
merits a full chapter in this book. Surely the concept as outlined above is quite 
trivial. Unfortunately, in checkpointing (as in so much else), the devil is in the 
detail. Here are some of the issues that arise: 

1. At what level (user or kernel) should we checkpoint: what are the pros and 
cons of each level? How transparent to the user should the checkpointing 
process be? 

2 . How many checkpoints should we have? 

3 . At which points during the execution of a program should we checkpoint? 

4 . How can we reduce checkpointing overhead? 

5 . How do we checkpoint distributed systems in which there may or may not be 
a central controller, and in which messages pass between individual processes? 

In addition to these issues, there is the question of how to restart the computa¬ 
tion at a different node if that becomes necessary. A program does not exist in 
isolation: it interacts with libraries and the operating system. Its page tables may 
need to be adjusted to reflect any required changes to the virtual-to-physical ad¬ 
dress translation. In other words, we have to be careful to ensure, when restarting 
on processor B a task checkpointed on processor A, that the execution environ¬ 
ment of B is sufficiently aligned with that of A to allow this restart to proceed 
correctly. 

Furthermore, program interactions with the outside world should be carefully 
considered because some of them cannot be undone. For example, if the system 
has printed something, it cannot unprint it. A missile, once launched, cannot be 
unlaunched. Such outputs must therefore not be delivered before the system is 
certain that it will not have to undo them. 


Checkpoint Level 


Checkpointing can be done at the kernel, application, or the user level. 
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■ Kernel-Level Checkpointing. If checkpointing procedures are included 
in the kernel, checkpointing is transparent to the user and generally no 
changes are required to programs to render them checkpointable. When the 
system restarts after failure, the kernel is responsible for managing the re¬ 
covery operation. 

In a sense, every modern operating system takes checkpoints. When a 
process is preempted, the system records the process state, so that execu¬ 
tion can resume from the interrupted point without loss of computational 
work. However, most operating systems provide little or no checkpointing 
explicitly for fault tolerance. 

■ User-Level Checkpointing. In this approach, a user-level library is provided 
to do the checkpointing. To checkpoint, application programs are linked to 
this library. As with kernel-level checkpointing, this approach generally re¬ 
quires no changes to the application code; however, explicit linking is re¬ 
quired with the user-level library. The user-level library also manages re¬ 
covery from failure. 

■ Application-Level Checkpointing. Here, the application is responsible 
for carrying out all the checkpointing functions. Code for checkpointing 
and managing recovery from failure must therefore be written into the 
application. This approach provides the user with the greatest control 
over the checkpointing process but is expensive to implement and de¬ 
bug. 

Note that the information available to each level may be different. For example, 
if the process consists of multiple threads, the kernel is generally not aware of 
them: threads are a level of detail invisible at the kernel level. Similarly, the user 
and application levels do not have access to information held at the kernel level. 
Nor can they ask, upon recovery, that a recovering process be assigned a particular 
process identifying number. As a result, a single program could have multiple 
process identifiers over the course of its life. This may or may not be a problem, 
depending on the application. Similarly, the user and application levels may not 
be allowed to checkpoint parts of the file system: in such cases, we may have to 
store the names and pointers to the appropriate files instead. 


Optimal Checkpointing— 

An Analytical Model 

We next provide a model which quantifies the impact of latency and overhead 
on the optimal placement of checkpoints. We have already mentioned that in a 
modem system, the checkpointing overhead may be much smaller than the check¬ 
pointing latency. Briefly, the overhead is that part of the checkpointing activity that 
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FIGURE 6.3 Checkpointing latency and overhead (squares represent latency and the 
shaded portions represent overhead). 


is not hidden from the application; it is the part that is not done in parallel with the 
application execution. Intuitively, it should be clear that the checkpointing over¬ 
head has a much greater impact on performance than the latency. 

Let us begin by introducing some notation, with the aid of Figure 6.3. Denot¬ 
ing the latency by T\ t , it is the time interval between when the checkpointing op¬ 
eration starts (e.g., to in the figure) and when it ends (f 2 in the figure). To sim¬ 
plify the expressions below, we assume that this time interval is of a fixed size; 
in other words, T; t = t 2 — to = fs — h = h — h- The three checkpoints that are 
shown in Figure 6.3 represent the state of the system at to, 1 3 , and tg, respec¬ 
tively. The overhead, denoted by T ov , is that part of the T| t interval during which 
the application is blocked from executing due to the checkpointing process. Flere 
too, for simplicity, we assume that this is a fixed-size interval and in the figure, 
T 0 v = h — to = h — h = h — h- 

If a failure occurs some time during the latency interval Ti t , we assume that 
the checkpoint being taken is useless and that the system must roll back to the 
previous checkpoint. For example, if a failure occurs anytime in the interval [( 3 , ( 5 ] 
in Figure 6.3, we have to roll back to the preceding checkpoint that contains the 
state of the process at time to- 

In the previous simpler model, we assumed that recovery from failure was in¬ 
stantaneous. Flere, we make the more realistic assumption that the recovery time 
has an average of T r . That is, if a transient failure hits a process at time r, the 
process becomes active again only at an expected time of r + T r . This recovery 
time includes the time spent in a faulty state plus the time it takes to recover to a 
functional state (e.g., the time it takes to complete rebooting the processor). 

Let us consider the interval of time between when the zth checkpoint has been 
completed (and is ready to be used, if necessary) and when the i + 1 st check¬ 
point is complete, and denote its expected value by £i nt . Let T ex be the amount of 
time spent executing the application over this period. That is, if N checkpoints are 
placed uniformly through the program's execution time T, then T ex = T/(N + 1). 
Thus, if Ejnt is unaffected by failure, it will be equal to T ex + T ov . 

What happens if there is a failure r units into the interval T e x + T ov ? First, we 
lose all the work that was done after the preceding checkpoint was taken. That 
work is the union of (a) the useful work done during the latency period, which is 
equal to T\ t — T ov , and (b) the work done since the interval began. The lost work is 
thus given by r + T; t — T ov . 
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Second, it takes an average time of T r units to recover from this failure and 
restart computations. Hence, the total amount of extra time due to a failure that 
occurs r time units into the interval is r + T\ t — T ov + T r . 

6.3.1 Time Between Checkpoints— 

A First-Order Approximation 

In the first-order approximation, we assume that at most one failure strikes the 
system between successive checkpoints. To calculate the expected time between 
two successive checkpoints, we follow the same conditioning strategy as before: 
we look at two cases, find the contribution of each case to the expected time, and 
add up the weighted contributions. 

Case 1 involves no failure between successive checkpoints. Since the intercheck¬ 
point interval is T ex + T ov , the probability of Case 1 is e~ ; '- Tcx+Tov - ) and the contri¬ 
bution of Case 1 to the expected interval length is 

(T e x + T ov )e-* (Tex+Tov) 

Case 2 involves one failure during the intercheckpoint interval: this happens 
with a probability that can be approximated by 1 — e~ X(Tex+Tov b This is actu¬ 
ally the probability of at least one failure over an interval of length T e x + T ov , 
but if we assume that fault arrivals follow a Poisson process, then the proba¬ 
bility of experiencing n failures over the interval T ex + T ov drops very rapidly 
with n when /,(T ex + T ov ) <<C 1 (as is usually the case). We therefore assume that 
the probability of more than one failure between checkpoints is negligible. The 
amount of additional time taken due to the failure is r + T r + T i t — T ov ; the aver¬ 
age value of r is (T ex + T ov )/2. Hence, the expected amount of additional time is 
(T ex + T ov )/2 + T r + Ti t — T ov . This time period is spent on top of the basic time 
needed for execution and checkpointing T ex + T ov , and thus the total expected 
contribution of Case 2 is approximately 


(1 


A.(T ex +T ov 


>) 


T ex + T ov 

T ex + Tov + — + T r + T lt - Tov 


= (i 


^—X(T ex -\-T 0 - 


>) 


^ + ^P+T,. + T lt 


Adding up the contributions of Cases 1 and 2, we obtain the expected length of 
the intercheckpoint interval, Ej nt : 


Tint « ^T ex + + T r + T lt - ( + T r + Tit - )e“ MTex+T °v) 


(6.4) 
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Now, consider the sensitivity of Ej nt to T ov and T\ t . From Equation 6.4, we have 


dEjnt 

dT 0V 

dEjnt 

d^7 


1 

2 +^ 


Tex — T ov 


+ T r + T lt 


T>v) 


l _g — HT ex +T ov ) 


From these equations, we can see that 


( 6 . 5 ) 

( 6 . 6 ) 


dEjnt 

dT 0V ^ 


dEjnt 

dT lt 


when A(T ex + T ov ) 1. 

This confirms our intuition that the sensitivity of the expected intercheckpoint 
interval to the overhead, T ov , is much greater than its sensitivity to the latency, Tj t . 
Therefore, we should do whatever we can to keep T ov low, even if we have to pay 
for it by increasing T\ t substantially. 


6.3.2 Optimal Checkpoint Placement 

The above analysis focused on calculating the expected length of the intercheck¬ 
point interval, £j n t, given a specific number, N, of equally spaced checkpoints, such 
that we execute the program for T ex = T/(N + 1) time units between any two con¬ 
secutive checkpoints (where T is the execution time of the program, not including 
checkpointing and recovery from failures.) One of the main problems of check¬ 
pointing is the need to decide on the value of T ex or, in other words, determine 
how many checkpoints to schedule during the execution of a long program. 

The problem of determining the optimal number of checkpoints is known as the 
checkpoint placement problem and its objective is to select N (or equivalently, T ex ) so 
as to minimize the expected total execution time of the program or equivalently, 
to minimize the figure of merit 


V — 


Ejnt 


We show next how to determine the optimal value of T ex for the simple model 
described above. Simplifying Equation 6.4 by using the first-order approximation 


-X(T ex +T ov ) ^ j _ A(Tgx + Tqv ) 


we obtain 
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I Tex + % + Tr + T lt - (^f + T r + T lt - )(1 - A.(T ex + Tov)) , 

71 = -1 

T 

1 ex 

= (Tex + T ov )[l + kgf + T,- + T lt - )] _ 1 

Tex 

To select T ex so as to minimize r], we differentiate Equation 6.7 with respect to T ex 
and equate the derivative to zero, yielding 



Based on the value of T° x \ we can calculate the number of checkpoints to mini¬ 
mize ri 

N °P‘ = ~ 1 

1 ex 

Keep in mind that the above result is correct only for the simplified model with 
at most one failure during the intercheckpoint interval. We relax this assumption 
in the next section, where a more accurate model is presented. 

The alert reader may have been somewhat surprised by the appearance of T r 
in the above expression for N op t. T r is the cost of recovering from a failure and, 
intuitively, is not expected to affect the optimal number of checkpoints. Indeed, 
T, disappears from the expression for N op t in the more exact model, as we will 
see below. In the Exercises, we invite the reader to find an intuitive reason for the 
presence of T,- in the expression for N opt for the approximate model. 

Note that we arrived at this result while deciding to place the checkpoints uni¬ 
formly along the time axis. Is uniform placement optimal? If the checkpointing 
cost is the same, irrespective of when the checkpoint is taken, the answer is "yes." 
If the checkpoint size—and hence the checkpoint cost—varies greatly from one 
part of the execution to the other, the answer is often "no," and depends on the 
extent to which the checkpoint size varies. 


6.3.3 Time Between Checkpoints— 

A More Accurate Model 

To relax the assumption that there is at most one failure in an intercheckpoint 
interval, we go back to the conditioning on the time of the first failure but now 
deal more accurately with Case 2. As before. Case 1 in which there are no failures 
between successive checkpoints contributes 

(T e x + T ov )e-^ T - +T -) 

to the average intercheckpoint time E; nt . 
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In Case 2, suppose a failure occurred at time r (r < T ex + T ov ), an event that 
has a probability of ke _AT dr. The amount of time wasted due to the failure is 
r + T r + T\ t — T ov , after which the computation will resume and will take an added 
average time of Ej nt . The contribution of Case 2 is therefore 

/*hx +Tov 

/ ( T + T r + Tjt — T ov + £i n t)ke Xt dr 

J T=0 

= Eint + T r + T lt - T ov + I - (r ex + T r + T lt + I + E int ^e- A < T “ +TOT > 
Adding the two cases results in the following equation for Ej nt 

Eint = {T e x + T ov )e A ( r ex+f 0 v) _). E int + T r + T] t — T ov + — 

A 

- (V ex + T r + Tlt + l + Eint)e-^ T - +T “v) 


whose solution is 


; — T r + Tlt — + — ) ( e 


M^ex+T ov) 


-i) 


( 6 . 9 ) 


Since T ov appears in the exponent in Equation 6.9, £j nt is more sensitive to T ov 
than to Ti t . 

Consider again the figure of merit. 



that should be minimized to ensure that the normalized cost of checkpointing is 
minimal. 

Suppose we look for a T ex that minimizes iy. this is obviously the value of T ex 
for which dr]/ 3T ex = 0 and 3 2 ??/3Tg X > 0. It is easy to show that the optimal value 
of T ex is one that satisfies the equation 


0^(^ex+7"ov) _ _ 

1 — XT ex 


( 6 . 10 ) 


Thus, the optimal value, T° x , does not depend on the latency or the recovery 
time T r , just the overhead, T ov . Once the value of T° x f is known, we can calculate 

the corresponding optimal number of checkpoints: N op t — -^pr — 1. 

E-’x 

In sequential checkpointing, the application cannot be executed in parallel 
with the checkpointing. We therefore have Ti t = T ov , and the overhead ratio 
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becomes 


(T r + l)(e l ^ +Tov '> - 1) 


which reduces to the expression in Equation 6.3 when T r = T ov — 0. 


( 6 . 11 ) 


6.3.4 Reducing Overhead 

Buffering 

The most obvious way to reduce checkpointing overhead is to use a buffer. The 
system writes the checkpoint into a part of its main memory and then returns 
to executing the application. Direct memory access (DMA) is then used to copy 
the checkpoint from main memory to disk. DMA in most modern machines only 
requires CPU involvement at the beginning and at the end of the operation. 

A refinement of this approach is called copy-on-write buffering. The idea is that if 
large portions of the process state have remained unchanged since the last check¬ 
point, it is a waste of time to copy the unchanged pages to disk all over again. 
Avoiding the recopying of the unaltered pages is facilitated by exploiting the mem¬ 
ory protection bits provided by most memory systems. Briefly, each page of the 
physical main memory is provided with protection bits that can indicate whether 
the page is read-write, read-only, or inaccessible. To implement copy-on-write 
buffering, the protection bits of the pages belonging to the process are all set to 
read-only when the checkpoint is taken. The application continues running while 
the checkpointed pages are transferred to disk. Should the application attempt to 
update a page, an access violation is triggered. The system is then supposed to re¬ 
spond by buffering the appropriate page, following which the permission on that 
page can be set to read-write. The buffered page is, in due course, copied to disk. 
(Clearly, the user-specified status of a page has to be saved elsewhere, to prevent 
a read-only or inaccessible page being written into). 

The advantage of copy-on-write over simple buffering is that if the process does 
not update the main memory pages too often, most of the work involved in copy¬ 
ing the pages to a buffer area can be avoided. This is an example of incremental 
checkpointing, which consists of simply recording the changes in the process state 
since the previous checkpoint was taken. If these changes are few, the size of the 
incremental checkpoints will be quite small, and much less will have to be saved 
per checkpoint. 

The obvious drawback of incremental checkpointing is that the process of re¬ 
covery is more complicated. It is no longer a matter of simply loading the latest 
checkpoint and resuming computation from there; one has to build the system 
state by examining a succession of incremental checkpoints. 

Memory Exclusion 

Another approach to lowering the checkpointing overhead attempts to reduce the 
amount of information that must be stored upon a checkpoint. There are two types 
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of variables that are unnecessary to record in a checkpoint: those that have not 
been updated since the last checkpoint, and those that are "dead." A dead variable 
is one whose present value will never again be used by the program. There are two 
kinds of dead variables: those that will never again be referenced by the program, 
and those for which the next access will be a write. The challenge is to accurately 
identify such variables. 

The address space of a process has four segments: code, global data, heap, and 
stack. Finding some dead variables in the code and stack is not difficult. Because 
self-modifying code is no longer used, we can regard the code segment in mem¬ 
ory as read-only, which need not be checkpointed. The stack segment is equally 
simple: the contents of addresses held in locations below the stack pointer are ob¬ 
viously dead. (The virtual address space usually has the stack segment at the top, 
growing downward: locations below the stack pointer represent memory not cur¬ 
rently being used by the stack.) As far as the heap segment is concerned, many lan¬ 
guages allow the programmer to explicitly allocate and deallocate memory (e.g., 
the mallocQ and freeQ calls in C). The contents of the free list are dead by defi¬ 
nition. Finally, some user-level checkpointing packages (e.g., libckpt) provide the 
programmer with procedure calls (such as checkpoint JiereQ) that specify regions of 
the memory that should be excluded from, or included in, future checkpoints. 


6.3.5 Reducing Latency 


Checkpoint compression has been suggested as one way to reduce latency. The 
smaller the checkpoint, the less that has to be written onto disk. Flow much, if 
anything, is gained through compression depends on 

■ The extent of the compression. This is application dependent: in some cases, 
the compression reduces checkpoint size by over 50%; in others, it barely 
makes a difference. 

■ The work required to execute the compression algorithm. This usually has 
to be done by the CPU and thus contributes to the checkpointing overhead. 

In simple sequential checkpointing, where the CPU does not execute until the 
checkpoint has been committed to disk, compression is beneficial whenever the 
reduction in disk write time more than compensates for the execution time of the 
compression algorithm. In more efficient systems, where T ov < Tit/ the usefulness 
of this approach is questionable and must be carefully assessed before being used. 

Another way of reducing latency is the incremental checkpointing technique 
mentioned earlier. 
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6.4 Cache-Aided Rollback Error Recovery 
(CARER) 

Reducing checkpointing overhead allows us to increase the checkpointing fre¬ 
quency, thereby reducing the penalty of a rollback upon failure. The Cache-Aided 
Rollback Error Recovery (CARER) approach is a scheme that seeks to reduce the 
time required to take a checkpoint by marking the process footprint in main mem¬ 
ory and the cache as parts of the checkpointed state. This, of course, assumes that 
the memory and cache are far less prone to failure than is the processor itself, 
and are therefore reliable enough to store checkpoints. If not, the probability of 
the checkpoint itself being corrupted will be unacceptably high and the CARER 
approach cannot be used. 

The checkpoint consists of the processes' footprint in main memory, together 
with any lines of the cache which may be marked as being part of the checkpoint. 
This approach requires a hardware modification to be made to the system, in the 
form of an extra checkpoint bit associated with each cache line. When this bit is 1, 
it indicates the corresponding line is unmodifiable, which means that the line is part 
of the latest checkpoint, and so the processor may not update any word in that line 
without being forced to take a checkpoint immediately after that update. If the bit 
is 0, the processor is free to modify the word. 

Because all of the process footprint in the main memory and the marked lines in 
the cache do double duty as both memory and part of the checkpoint, we have less 
freedom in deciding when checkpoints have to be taken. The general rule is that 
a checkpoint is forced whenever the system needs to update anything in a cache 
line whose checkpoint bit is 1, or in the main memory. If a checkpoint is not taken 
at such a time, then, upon a fault occurring afterward, the system will rollback to 
the old values of the processor registers, but to modified contents of the memory 
and /or cache. The above implies that checkpoints are also forced when an external 
interrupt occurs or an I/O instruction is executed (since either could update the 
memory). To summarize, we are forced to take a checkpoint every time one of the 
following happens: 

■ A cache line marked unmodifiable is to be updated. 

■ The main memory is to be updated. 

■ An I/O instruction is executed or an external interrupt occurs. 

Taking a checkpoint involves (a) saving the processor registers in memory, and 
(b) setting to 1 the checkpoint bit associated with each valid cache line. By defi¬ 
nition, therefore, a line in the cache whose checkpoint bit is 1 was last modified 
before the latest checkpoint was taken. 

As a result, the checkpoint consists of the footprint of the process in the main 
memory, together with all the cache lines that are marked unmodifiable and the 
register copies. Rolling back to the previous checkpoint is now very simple: just 
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restore the registers from their copies in memory and mark as invalid all the lines 
in the cache whose checkpoint bit is 0. 

This approach is not without its costs. The hardware of the cache has to be 
modified to introduce the checkpoint bit, and every write-back of any cache line 
into main memory involves taking a checkpoint. 

Checkpointing in Distributed Systems 

A distributed system consists of a set of processors and their associated memories, 
connected by means of an interconnection network (see Chapter 4). Each proces¬ 
sor usually has local disks, and there can also be a network file system equally 
accessible to all the processors. 

Logically, we will consider a distributed system to consist of a number of 
processes connected together by means of directional channels. Channels can be 
thought of as point-to-point connections from one process to another. Unless oth¬ 
erwise specified, we will assume that each channel is error-free and delivers all 
messages in the order in which it received them. 

We start by providing some details about the system model underlying the 
analysis that follows. 

The state of a process has the obvious meaning: the state of the channel at any 
time t is the set of messages carried by this channel up to time f (together with 
the order in which they were received). The state of the distributed system is the 
aggregate of the states of the individual processes and of the channels. 

The state of a distributed system is said to be consistent if, for every message 
delivery recorded in the state, there is a corresponding message-sending event. 
A state that violated this constraint would, in effect, be saying that we have a 
message delivered that had not yet been sent. This violates causality, and such 
a message is called an orphan. Note that the converse need not be the case; it is 
perfectly consistent to have the system state reflect the sending of a message but 
not its receipt. 

Figure 6.4 provides an illustration. Here, we have two processes, P and Q, each 
of which has two checkpoints (CPi, CP 2 , and CQi, CQ 2 , respectively), taken over 
the duration shown here. Message m is sent by P to Q. 



CQj CQ 2 time 


FIGURE 6.4 Consistent and inconsistent states. 
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The following sets of checkpoints represent a consistent system state: 

■ {CP|,CQ|}: Neither checkpoint has any information about m. 

■ {CP2,CQ 1 }: CP2 records that m was sent; CQ, has no record of receiving m. 

■ {CP 2, CQ2}: CP2 records that m was sent; CQ 2 records that it was received. 

In contrast, the set {CPi, CQ 2 } does not represent a consistent system state. CPi has 
no record of m being sent, whereas CQ2 records that m was received, m is therefore 
an orphan message in this set of checkpoints. 

A set of checkpoints that represents a consistent system state is said to form a 
recovery line. We can roll the system back to any available recovery line and restart 
from there: 

■ {CPnCQj}: Rolling back P to CP\ undoes the sending of in and rolling back 
Q to CQ, means that Q does not have any record of having received in. 
Thus, restarting from these checkpoints, P will again send out m, which Q 
will receive in due course. 

■ {CP2,CQ 1 }: Rolling back P to CP2 means that it will not retransmit in; how¬ 
ever, rolling back Q to CQ| means that now Q has no record of ever having 
received m. In this case, the system managing the recovery has to be able 
to play back m to Q. This can be done by using the checkpoint of P or by 
having a separate message log, recording everything received by Q. We will 
discuss message logs later. 

■ {CP2, CQ 2 }: The checkpoints record the sending, and receipt, of in. 

Sometimes, checkpoints may be placed in such a way that they will never form 
part of a recovery line. Figure 6.5 provides such an example. CQ2 records the re¬ 
ceipt of nii, but not the sending of 1117. {CPi,CQ 2 } cannot be consistent (since oth¬ 
erwise mi would become an orphan); similarly {CP2,CQ2l cannot be consistent 
(since otherwise m2 would become an orphan). 


P 

Q 


cp, cp 2 



CQ, CQ 2 CQ 3 


FIGURE 6.5 CQ 2 is a useless checkpoint. 
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6.5.1 The Domino Effect and Livelock 

If we do not coordinate checkpoints either directly (through message passing) or 
indirectly (by using synchronized clocks), a single failure could cause a sequence 
of rollbacks that send every process back to its starting point. This is called the 
domino effect. 

In Figure 6.6, we have a distributed system consisting of two processors, P 
and Q, sending messages to each other. The checkpoints are positioned as shown. 
When P suffers a transient failure, it rolls back to checkpoint CP3. However, be¬ 
cause it sent out a message, ms, after CP3 was taken, Q has to roll back to before it 
received this message (otherwise Q would have recorded a message that was offi¬ 
cially never sent: an orphan message). Consequently, Q must roll back to CQ2. But 
this will trigger a rollback of P to CP2 because Q sent a message, n/5, to P, and P 
has to move back to a state in which it never received this message. This continues 
until all of the processes have rolled back to their starting positions. This sequence 
of rollbacks is an example of the domino effect. 

It is the interaction between the processes in the form of messages being passed 
between them that gives rise to the domino effect. The problem arises when we in¬ 
sist on the checkpoints forming a consistent distributed state, at which no orphan 
messages exist. There is a somewhat weaker problem that arises when messages 
are lost due to rollback, illustrated in Figure 6 . 7 . Suppose Q rolls back to CQi after 
receiving message m from P. When it does so (unless inter-processor messages are 
stored somewhere safe), all activity associated with having received that message 
is lost. If P does not roll back to CP2, then the situation is as if P had sent a mes¬ 
sage which was never received by Q. This is not as severe a problem as orphan 


P 


Q 


failure happens here 



FIGURE 6.6 Example of the domino effect. 
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FIGURE 6.7 Example of a lost message. 
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FIGURE 6.8 Example of livelock. 



FIGURE 6.9 P taking CP 3 forces Q to checkpoint. 

messages because lost messages do not violate causality. They can be treated as 
any messages that may be lost due to network problems, for example, by retrans¬ 
mission. Note, however, that if Q had sent an acknowledgment of that message 
to P before rolling back, then that acknowledgment would be an orphan message 
unless P rolls back to CP 2- 

There is another problem that can arise in distributed checkpointed systems: 
that of livelock. Consider the situation shown in Figure 6.8. Q sends P a message 
mi, and P sends Q message m2. Then, P fails at the point shown, before receiving m\. 
To prevent m2 from being orphaned, Q must roll back to CQi. In the meantime, P 
recovers, rolls back to CP2, sends another copy of m2, and then receives the copy 
of ni] that was sent before all the rollbacks began. However, because Q has rolled 
back, this copy of m\ is now orphaned, and so P has to repeat its rollback. This in 
turn, orphans the second copy of m2 as well, forcing Q to also repeat its rollback. 
This dance of the rollbacks may continue indefinitely unless there is some outside 
intervention. 

6.5.2 A Coordinated Checkpointing Algorithm 

We have seen that if checkpointing is uncoordinated, distributed systems can suf¬ 
fer the domino effect or livelock. In this section, we outline one approach to check¬ 
point coordination. 

Consider Figure 6.9 and suppose that P wants to establish a checkpoint at CP3. 
This checkpoint will record, among other things, that message m was received 
from Q. As a result, to prevent this message from ever being orphaned, Q must 



6.5 Checkpointing in Distributed Systems 


211 


checkpoint as well. That is, if we want to prevent m from ever becoming an or¬ 
phan message, the fact that P establishes a checkpoint at CP 3 forces Q to take a 
checkpoint to record the fact that m was sent. 

Let us now describe an algorithm that carries out such coordinated checkpoint¬ 
ing. There are two types of checkpoints in this algorithm, tentative and permanent. 
When a process P wants to take a checkpoint, it records its current state in a ten¬ 
tative checkpoint. P then sends a message to all other processes from whom it 
received a message since taking its last checkpoint. Call this set P. This message 
tells each process Q the last message, HZqp, that P received from it before the ten¬ 
tative checkpoint was taken. If sending message m q p has not been recorded in a 
checkpoint by Q, then to prevent m C) p from being orphaned, Q will be asked to 
take a tentative checkpoint recording the sending of MZqp. If all the processes in 
P that need to, confirm taking a checkpoint as requested, then all the tentative 
checkpoints can be converted to permanent checkpoints. If, for some reason, one 
or more members of P are not able to checkpoint as requested, P and all other 
members of P abandon their tentative checkpoints, instead of making them per¬ 
manent. 

Note that this process can set off a chain reaction of checkpoints. If P initiates a 
round of checkpointing among processes in P, each member of P can itself poten¬ 
tially spawn a set of checkpoints among processes within its corresponding set. 

6.5.3 Time-Based Synchronization 

Orphan messages cannot happen if each process checkpoints at exactly the same 
global time. However, this is practically impossible because clock skews and mes¬ 
sage communication times cannot be reduced to zero. Time-based synchronization 
can still be used to facilitate checkpointing: we just have to take account of nonzero 
clock skews in doing so. 

In time-based synchronization, we checkpoint the processes at previously 
agreed times. For example, we may ask each process to checkpoint when its lo¬ 
cal clock reads a multiple of 100 seconds. By itself, such a procedure is not enough 
to avoid orphan messages (see Figure 6.10). Here, each process is checkpointing at 
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FIGURE 6.10 Creation of an orphan message in time-based synchronization. 
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time 1100 (where time is read off the local clock). Unfortunately, the skew between 
the two clocks is such that process Po checkpoints much earlier (in real time) than 
does process Pi. As a result, Po sends out a message to Pi after its checkpoint, 
which is received by Pi before its checkpoint. This message is a potential orphan. 

If clock skews can be bounded, it is easy to prevent such orphan messages from 
being generated. Suppose the maximum skew between any two clocks in the dis¬ 
tributed system is S, and each process is asked to checkpoint when its local clock 
reads r. Following this checkpoint, a process Po should not send out messages to 
any process Pi until it is certain that Pi's local clock reads later than r. Because the 
skews are upper-bounded by S, this means that Pq should remain silent over the 
duration [r, r + <$] (as measured by Po's local clock). 

We can shorten this interval of silence if there is a lower bound on the inter¬ 
process message delivery time. If this time is e, then it is clearly enough for process 
Po to remain silent over the duration [r, r + S — e] to prevent the formation of or¬ 
phan messages. (If e > S, this interval is of zero length, and there is no need for 
such an interval of silence.) 

Yet another variation is for a process that receives a message to not include it in 
its checkpoint and not act upon it if the message could possibly become an orphan. 
Suppose message m is received by process Pi when its clock reads t. Message m 
must have been sent (by, say, process Po) no later than e units earlier, before Pi's 
clock reads t — e. Because the clock skew is upper-bounded by <5, at this time, Po's 
clock should have read at most t — e+S.lft — 6 + S<t, then the sending of 
this message would have been recorded in Po's checkpoint, and as a result, the 
message cannot be an orphan. Hence, if message m is received by Pi when its 
clock reads at least r — 8 + e, it cannot be an orphan. Thus, another way to avoid 
orphan messages is for a receiving process not to act upon any message received 
in a window of time [r — 8 + e, r] (neither use it nor include it in its checkpoint at 
time r) until after taking its own checkpoint at time r (time as told by the receiving 
process's local clock). 

6.5.4 Diskless Checkpointing 

Main memory is volatile and is, by itself, often unsuitable as a medium in which 
to store a checkpoint. However, with extra processors, we can borrow some tech¬ 
niques from RAID (see Section 3.2) to permit checkpointing in main memory. By 
avoiding disk writes, checkpointing can be made much faster. Diskless check¬ 
pointing is probably best used as one level in a two-level checkpointing scheme 
which is mentioned in the Further Reading section. 

Diskless checkpointing is implemented by having redundant processors using 
RAID-like techniques to deal with failure. For example, suppose we have a distrib¬ 
uted system consisting of six executing, and one extra, processors. Each execut¬ 
ing processor stores its checkpoint in its own memory; the extra processor stores 
in its memory the parity of these checkpoints. Thus, if any one of the executing 
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processors were to fail, its checkpoint can be reconstructed from the remaining 
five checkpoints plus the parity checkpoint. 

We can similarly use other levels of RAID as analogs. For example, RAID level 
1 involves disk mirroring. By analogy, we can mirror the checkpoints; in other 
words, hold in two separate main memory modules identical copies of each check¬ 
point. Such a system can obviously withstand up to one failure. 

In such systems, the interprocessor network must have enough bandwidth to 
cope with the sending of checkpoints. Also, hotspots can develop that will slow 
down the whole system. For example, suppose we have several executing and one 
checkpointing processors. If all the executing processors send their checkpoints 
to the checkpointing processor to have the parity calculated, the result will be a 
potentially debilitating hotspot. We can alleviate the problem by distributing the 
parity computations as shown in Figure 6.11. 

6.5.5 Message Logging 

Recovery consists of rolling back to the latest checkpointing and taking up the 
computation from that point. In a distributed system, however, to continue the 
computation beyond the latest checkpoint, the recovering process may require all 
the messages it received since that checkpoint, played back in the same order as 
it originally got them. If coordinated checkpointing is used, each process can be 
rolled back to its latest checkpoint and restarted: those messages will automatically 
be resent during the re-execution. Flowever, if we want to avoid the overhead of 
coordination and decide to let processes checkpoint independently of one another, 
logging messages into stable storage is an option. 

We will consider two approaches to message logging: pessimistic and opti¬ 
mistic. Pessimistic message logging ensures that rollback will not spread to other 
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processes; if a process fails, no other process will need to be rolled back to ensure 
consistency. In contrast, in optimistic logging, we may have a situation in which a 
process failure can trigger the rollback of other processes as well. 

Throughout this section, we will assume that to recover a process, it is sufficient 
to roll it back to some checkpoint and then replay to it the messages it received 
since that point, in the order in which they were originally received. 

Pessimistic Message Logging 

Several pessimistic message logging algorithms exist. Perhaps the simplest is for 
the receiver of a message to stop whatever it is doing when it receives a mes¬ 
sage, log the message onto stable storage, and then resume execution. Recovering 
a process from failure is extremely simple: just roll it back to its latest checkpoint 
and play back to it the messages it received since that checkpoint, in the right 
order. No orphan messages will exist in the sense that every message will have 
been either received before the latest checkpoint or explicitly saved in the message 
log. As a result, rolling back one process will not trigger the rollback of any other 
process. 

The requirement that a process must log messages into its stable storage (as op¬ 
posed to a volatile storage) can impose a significant overhead. If we are designing 
the system to be able to withstand at most one isolated failure at any one time, then 
the above-mentioned basic algorithm is overkill, and sender-based message logging 
can be used instead. 

As its name implies, the sender of a message records it in a log. To save 
time, this log is stored initially in a high-speed buffer; when required, the 
log can be read to replay the message. This scheme is implemented as fol¬ 
lows. Each process has a send-counter and a receive-counter, which increments 
every time the process sends or receives a message, respectively. Each mes¬ 
sage has a Send Sequence Number (SSN), which is the value of the send- 
counter at the node when it is transmitted. When a process receives a mes¬ 
sage, it allocates it a Receive Sequence Number (RSN), which is the value of 
the receive-counter (at the receiver end) when it was received. The receiver 
also sends out an acknowledgment to the sender, including the RSN it has al¬ 
located to the message. Upon receiving this acknowledgment, the sender ac¬ 
knowledges the acknowledgment in a message to the receiver. Between the 
time that the receiver receives the message and sends its acknowledgment, 
and when it receives the sender's acknowledgment of its own acknowledg¬ 
ment, the receiver is forbidden to send any messages to any other processes. 
This, as we shall see, is essential to maintaining correct functioning upon 
recovery. 

A message is said to be fully logged when the sending node knows both its SSN 
and its RSN; it is partially logged when the sending node does not yet know its RSN. 

When a process rolls back and restarts computation from the latest checkpoint, 
it sends out to the other processes a message listing the SSN of their latest message 
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that it recorded in its checkpoint. When this message is received by a process, it 
knows which messages are to be retransmitted, and does so. 

The recovering process now has to use these messages in the same order as 
they were used before it failed. This is easy to do for fully logged messages, be¬ 
cause their RSNs are available, and they can be sorted by this number. The only 
remaining problem is the partially logged messages, whose RSNs are not avail¬ 
able. Partially logged messages are those that were sent out, but whose acknowl¬ 
edgment was never received by the sender. This could be either because the re¬ 
ceiver failed before the message could be delivered to it or because it failed after 
receiving the message but before it could send out the acknowledgment. How¬ 
ever, recall that the receiver is forbidden to send out messages of its own to other 
processes between receiving the message and sending out its acknowledgment. 
As a result, receiving the partially logged messages in a different order the second 
time cannot affect any other process in the system, and correctness is preserved. 
This approach is only guaranteed to work if there is at most one failed node at any 
time. 

Optimistic Message Logging 

Optimistic message logging has a lower overhead than pessimistic logging; how¬ 
ever, recovery from failure is much more complex. At the moment, optimistic log¬ 
ging is probably not much more than of theoretical interest, and so we only pro¬ 
vide here a brief outline of the technique. 

When messages are received, they are written into a high-speed volatile buffer. 
Then, at a suitable time, the buffer is copied into stable storage. Process execu¬ 
tion is not disrupted, and so the logging overhead is very low. The problem is 
that upon failure, the contents of the buffer can be lost. This can lead to multiple 
processes having to be rolled back. For this method to work we need a scheme to 
compute the recovery line. See the Further Reading section for a pointer to such a 
scheme. 

Staggered Checkpointing 

Many checkpointing algorithms can result in a large number of processes taking 
checkpoints at nearly the same time. If they are all writing to a shared stable stor¬ 
age, such as a set of disks equally available to all processes through a network, 
this surge can lead to congestion at the disks or network or both. To avoid this 
problem, we can take one of the following two approaches. 

The first is to write the checkpoint into a local buffer and then stagger the ivrites 
of this buffer into stable storage. This assumes that we have a buffer of sufficiently 
large capacity. 

The second approach is to try staggering the checkpoints in time. Staggering 
can be done as follows. Ensure that, at any time, at most one process is taking 
its checkpoint. These checkpoints may not be consistent, meaning that there may 
well be orphan messages in the system. To avoid this, have a coordinating phase 
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in which each process logs in stable storage all messages it sent out since its previ¬ 
ous checkpoint. The message-logging phase of the processes will overlap in time; 
however, if the volume of messages sent is smaller than the size of the individual 
checkpoints, the disk system and the network will see a much reduced surge. 

If a process fails, it can be restarted after rolling it back to its last checkpoint. 
All the messages that are stored in the message log can be played back to it. As a 
result, the process can be recovered up to the point just before r, the time when 
it first received a message that was not logged. It is as if a checkpoint was taken 
just prior to r; we call this combination of checkpoint and message log a logical 
checkpoint. The staggered checkpointing algorithm guarantees that all the logical 
checkpoints form a consistent recovery line. 

Let us now state in a more precise manner the algorithm for a distributed sys¬ 
tem consisting of the n processors Pq,Pi, ... ,P n -\ - The algorithm consists of two 
phases: a checkpointing and a message-logging phase. The first phase is as follows: 

/* Checkpointing Phase */ 
for (i = 0; i <= n — 1; f-l—1-){ 

Pi takes a checkpoint. 

Pi sends a message to P(;+i)modH' ordering the 
latter to take a checkpoint. 

) 

The second phase begins at the end of the above loop when Po gets a message 
from P „~i ordering Py to take a checkpoint: this is the cue for Pq not to take another 
checkpoint but to initiate the second phase. It does this by sending out a marker 
message on each of its outgoing channels. When a process P, receives a marker 
message, it does the following: 

/* Message Logging Phase */ 

if (ho previous marker message was received in this round by Pf) 
then {Pi sends a marker message on each of its outgoing channels. 

Pi logs all messages received by it after the preceding 
checkpoint and before the marker was received. 

I 

else 

Pi updates its message log by adding all the messages received by 
it since the last message log and before the marker was received. 

end if. 

Consider the system shown in Figure 6.12a. It consists of three processes, Po, 
Pi, and Px, each of which can communicate with the others. Process Pq acts as 
the checkpointing coordinator; it starts the first phase of the algorithm by taking 
a checkpoint and sending out a take_checkpoint order to P\ to do so. Pi sends such 
an order to Px after taking its own checkpoint. Px sends a take_checkpoint order 
back to Pq. When Po receives this take_checkpoint order, it knows the first phase 
has completed: each of the processes has taken a checkpoint and the second phase 
of the algorithm can begin. Pq sends a messagejog order on each of its outgoing 
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FIGURE 6.12 Example for staggered checkpointing. 

channels, to Pi and Px, asking them to log onto stable storage the (application) 
messages they received since they recorded the checkpoint. Pi does so; Px has 
no such message to log. In each case, they send out similar messagejog orders. 
When, for example, Po receives such an order from Pi, it checks if it has received 
any messages between the last time it logged messages and when it received this 
order, and discovers that it has nothing to log. A little time later, it receives such 
an order from Px'- it responds to this by logging / 1 / 5 . 

Each time such a message is received, the process logs the messages; if it is the 
first time such a messagejog order is received by it, the process sends out marker 
messages on each of its outgoing channels. 

We are proceeding on the assumption that given the checkpoint and the mes¬ 
sages received, a process can be recovered. Hence, each process can be recovered 
up to the point when it receives a message that is not logged (this is the logical 
checkpoint position indicated in Figure 6.12b). 

Note that in this algorithm, we may have orphan messages with respect to the 
physical checkpoints that are taken in the first phase. However, orphan messages 
will not exist with respect to the latest (in time) logical checkpoints that can be 
generated using the physical checkpoint and the message log. 


Checkpointing in Shared-Memory Systems 

We now describe a variant of the CARER scheme for shared-memory, bus-based 
multiprocessors, in which each processor has its own private cache. This scheme 
involves changing the algorithm used to maintain cache coherence among the 
multiple caches in a multiprocessor. In this variant, in place of the single bit 
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that marked a line as unmodifiable, we have a multi-bit identifier: we associate a 
checkpoint identifier C K j with each cache line. A checkpoint counter, C C0U nt/ keeps 
track of the current checkpoint number. To take a checkpoint, we increment this 
counter. Thus, any line that was modified before this instant will have a Qd field 
which is smaller than the value of the counter. Whenever a line is updated, we set 
Qd = Ccount- If a line has been modified since being brought into the cache and 
Qd < C CO unt/ this line is part of the checkpoint state, and is therefore unmodifiable. 
Any writes into such a line must wait until the line is first written into the main 
memory. 

If the counter has k bits, it rolls over to 0 after reaching 2 k — 1. When it reaches 
2 k — 1 and a checkpoint is to be taken, each modified line has its Qd set to 0. 

6.6.1 Bus-Based Coherence Protocol 

Let us first consider a cache coherence algorithm without checkpointing. We will 
then see how it can be modified to take account of checkpointing. 

The algorithm is for bus-based multiprocessors: all the traffic between caches 
and memory must travel on this bus. This means that all the caches can watch the 
traffic on the bus. 

A cache line can be in one of the following states: invalid, shared unmodified, 
exclusive modified, and exclusive unmodified. Exclusive means that this is the only 
valid copy in any of the caches; modified means that the line has been modified 
since it was brought into the cache from the main memory. Figure 6.13 shows the 
state diagram associated with this algorithm. If the line is in shared unmodified state 
and the processor wishes to update it, it moves into the exclusive modified state. 
(All other caches holding the same line must invalidate their copies, since these 



FIGURE 6.13 Original bus-based cache coherence algorithm. 
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FIGURE 6.14 Bus-based cache coherence and checkpointing algorithm. 

are no longer current.) When in the exclusive modified or exclusive unmodified states, 
another cache puts out a read request on the bus, this cache must service that 
request (since it holds the only current copy of that line). As a by-product of this 
action, the memory is also updated if necessary. After doing so, the state moves 
from exclusive modified to shared unmodified. A write miss is handled by considering 
it to be a read miss followed by a write hit. Hence, when there is a write miss, the 
line is brought into the cache and its state becomes exclusive modified, because it is 
modified upon the write and this cache holds the only current copy of that line. 
The other transitions are reasoned similarly. 

How can we modify this protocol to account for checkpointing? The original 
exclusive modified state now splits into two: exclusive modified and unmodifiable. The 
state diagram for this algorithm is shown in Figure 6.14. When a line becomes part 
of the checkpoint, it is marked unmodifiable to keep it stable. Before this line can be 
changed, it must first be copied to memory so that it will be retained for use in the 
event of a rollback. 

6.6.2 Directory-Based Protocol 

In this approach to cache coherence, a directory is maintained centrally, which 
records the status of each line. We can regard this directory as being controlled by 
some shared-memory controller. This controller handles all read and write misses 
and all other operations that change line state. For example, if a line is in the exclu¬ 
sive unmodified state and the cache holding that line wants to modify it, it notifies 
the controller of its intention. The controller can then change the state to exclusive 
modified. It is then a simple matter to implement this checkpointing scheme atop 
such a protocol. 
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6.7 Checkpointing in Real-Time Systems 

A real-time system is characterized by the need to meet deadlines. In hard real-time 
systems, missing a deadline can be very costly; process control is one such exam¬ 
ple. In soft real-time systems, on the other hand, missed deadlines may lower the 
quality of service provided but are not catastrophic. Most multimedia systems are 
soft real-time systems. However, it is ultimately the application that determines 
whether the system is hard or soft. A multimedia system that is used for the re¬ 
mote control of a vehicle is a hard real-time system; the more common case in 
which it is used to watch movies over the Internet is soft real-time. 

The performance of a real-time system is related to the probability that the sys¬ 
tem will meet all its critical deadlines. Therefore, the goal of checkpointing in a 
real-time system is to maximize this probability and not to minimize the mean 
execution time. Indeed, checkpointing in a real-time system may well increase the 
average execution time: this is a price worth paying if the probability of missing a 
deadline decreases sufficiently. 

We present next an analytical model similar to the one presented in Section 6.3, 
but one that calculates the density function of the execution time of a task instead 
of the average execution time. We place a checkpoint after every T ex units of useful 
work; each checkpoint takes T ov units in overhead. We are assuming here that 
checkpoint latency and overhead are identical: the system is so simple that the 
CPU has no other unit to which to delegate the checkpointing task. Transient faults 
occur at a constant rate X. When a transient failure hits the processor, it goes down 
for time T r (including rebooting if necessary). 

Let/jnt(f) be the probability density function of the time taken between succes¬ 
sive initiations of checkpoints. There are two cases. In Case 1, there is no failure 
over the interval T ex + T ov ; in Case 2, there is at least one failure. 

If Case 1 occurs (which it does with probability e -A ( Tex+Tov )), the interval be¬ 
tween checkpoint initiations will be T ex + T ov . In Case 2, the time will be greater 
than T ex + T ov . To analyze Case 2, let us condition on the epoch of the first failure. 
Suppose the first failure hits r time units into the interval. Then, we lose all r time 
units of computation. Further, we take T r time units to recover. Hence, r + T r time 
units later, the processor is ready to restart execution of this interval. Following 
such a restart, the density function of the rest of the execution of this interval will 
be identical to the unconditional density function. Therefore, the conditional den¬ 
sity function of the execution time, conditioned on the first failure happening r 
time units into the interval, is/i nt (f — [r + T r ]). The probability of the first failure 
happening in the interval [r, r + dr] is Xe~ Xr dr. Thus, 

r Tex+T ov 

fint(t) — I Ae _AT /i n t(f-[r+ T r ])dr if t>T ex + T ov + T r 

J T=0 


( 6 . 12 ) 
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Clearly, the execution time can never be less than Tex + T ov , nor can it fall in the 
interval (T ex + T ov , T ex + T ov + T r ) because a failure takes time T r to recover from. 
Further, it will be exactly equal to T e x + T ov in the (common) case that there is 
no failure. This is represented by a Dirac delta function at that point of magnitude 
e —>-(T cx +T 0 v). (p or those unfamiliar with the term, a Dirac delta function, <5(f), has the 
property that for any density function/(f) and some constant a, /(f)<5(f — a) dt — 

f(a). It is an impulse function). 

To summarize, we can now write the density function as 


/int(0 

' e-Mhx+Tov)^ _ [Tex + Tov] ) 

= 0 

/ t = 0 +T ° v ~ [r + T r ]) dr 


if t = T e x + T ov 

if t^T ex + T ov and t ^ T ex + T ov + T r 
if t > T ex + T ov + T r 

( 6 . 13 ) 


Such an equation can be solved numerically. 

If we take N checkpoints, the density function of the overall execution time is 
the (N + l)-fold convolution of the density function per intercheckpoint interval: 
/exec(0 — /inf N+1 ^)- The average time taken is calculated as shown in Section 6.3.1. 
If the real-time deadline is tj, the probability of missing it is given by 



To demonstrate the tradeoff, let us consider a specific numerical example. Let 
T = 0.15 time units and X = 10 -3 per time unit. The recovery time is T r = 0.1 
unit. In Figure 6.15, the probability of missing a deadline is plotted for two cases: 
T ov = 0.015 and T ov = 0.025. Table 6-1 shows the average execution time as a func¬ 
tion of the number of checkpoints. For the parameters used, the expected execu¬ 
tion time actually worsens as we increase the number of checkpoints: this is to be 
expected because the probability of failure during execution is less than 1%. FIow- 
ever, when we focus on the probability of missing a deadline, the situation is more 
complicated (see Figure 6.15). For tight deadlines, when there is little available 
slack, increasing the number of checkpoints can make things worse. When dead¬ 
lines are further into the future, thereby making more slack available, a greater 
number of checkpoints improves matters. For example, for a deadline of 0.5 and 
T 0 v =0 .015, using six checkpoints is significantly better than using three. By con¬ 
trast, for a deadline of 0.3, having three checkpoints is better than six. In every 
case, the deadline-missing probabilities are small; however, there are real-time ap¬ 
plications where such probabilities have to be very low indeed. 

The reader should compare the results for T ov = 0.025 with those for T ov = 0.015 
and obtain an intuitive explanation for the differences seen. 
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6.8 Other Uses of Checkpointing 

Fault tolerance is but one application of checkpoints. Here, briefly, are two others. 

■ Process Migration. Since a checkpoint represents a process state, migrat¬ 
ing a process from one processor to another simply involves moving the 
checkpoint, after which computation can resume on the new processor. The 
nature of the checkpoint determines whether the new processor must be of 
the same kind and run the same operating system as the old one. 

Process migration can be used to recover from permanent or intermittent 
faults. Another use is in load balancing, to achieve overall better utilization 
of a distributed system by ensuring that the computational load is appro¬ 
priately shared among the processors. 

■ Debugging. Checkpointing can be used to provide the programmer with 
snapshots of the program state at discrete epochs. Such snapshots can be 
extremely useful to study the change of variable values over time and to get 
a deeper understanding of program behavior. 


6.9 Further Reading 


A good discussion of the various levels at which checkpointing can be done ap¬ 
pears in [21]. The distinction between checkpointing latency and overhead, and 
the greater impact of overhead, was pointed out in [31]. Copy-on-write for faster 
checkpointing is discussed in [17] and memory exclusion in [23]. A study of the 
feasibility of incremental checkpointing for scientific applications can be found 
in [25], 

Checkpoint placement for general-purpose systems has a large literature asso¬ 
ciated with it: some examples are [6,9,15,26,36,37], An early performance model 
for checkpointing is presented in [28]. CARER is described in [2,11], A more recent 
work on using caches in checkpointing can be found in [29]. 

There is an excellent survey of distributed checkpointing issues with a compre¬ 
hensive bibliography in [8]. A slightly more theoretical treatment can be found in 
[4]. Two widely cited early works in checkpointing in distributed systems are the 
algorithms which appeared in [5] and in [14] (described in Section 6.5.2). The stag¬ 
gered checkpointing algorithm is presented in [32], A good reference for the use 
of synchronized clocks to avoid explicit coordination during checkpointing is [19]. 
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Diskless checkpointing using approaches similar to that in RAID is discussed in 
detail in [20,22]. Two-level recovery is considered in [30]. This paper contains a 
detailed performance model of a two-level recovery scheme. 

There is a substantial literature on message logging, including optimistic and 
pessimistic algorithms [3,8], sender-based message logging [12], optimistic recov¬ 
ery schemes [13,27,33] and the drawbacks of optimistic algorithms [10]. 

When discussing message logging, we assumed that process recovery would 
follow if we rolled back the affected process to a checkpoint and then replayed the 
messages that it received beyond that point. This is not always true: it is possible 
for the process to take a different execution path if something in the operating 
environment is different (e.g., the amount of available swap space in the processor 
is different). For a discussion on this, see [7]. 

The bus-based coherence protocol is covered in [35]. 

Checkpointing in real-time systems is discussed in [16,26]. Checkpointing for 
mobile computers is a topic of growing interest, given the proliferation of mobile 
applications. For some algorithms, see [1,18,24], Other applications of checkpoint¬ 
ing (besides fault tolerance) are discussed in [34]. 


6.10 Exercises 

1. In Section 6.3.1, we derived an approximation for the expected time between 
checkpoints as a function of the checkpoint parameters. 

a. Calculate the optimum number of checkpoints, and plot the approximate 
total expected execution time as a function of T ov . Assume that T = 1, Ti t = 
T ov and X — 10 -5 . Vary T ov from 0.01 to 0.2. 

b. Plot the approximate total expected execution time as a function of X. Fix 
T = 1, T ov = 0.1, and vary X from 10 -7 to 10 -1 . 

2 . In Section 6.3.1, we derived an expression for N op t, the optimal number of 
checkpoints. We noted that this term includes T r , the recovery time per fail¬ 
ure. In particular, N opt tends to decrease as T r increases. 

Explain why the assumption that there can be no more than one failure in 
any intercheckpoint interval contributes to the presence of T r in this expression. 

3 . You have a task with execution time T. You take N checkpoints, equally spaced 
through the lifetime of that task. The overhead for each checkpoint is T ov and 

= T ov . Given that during execution, the task is affected by a total of k point 
failures (i.e., failures from which the processor recovers in negligible time), an¬ 
swer the following questions. 
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a. What is the maximum execution time of the task? 

b. Find N such that this maximum execution time is minimized. It is fine to 
get a non-integer answer (say x): in practice, this will mean that you will 
pick the better of |_*J and [Y] • 

4. Solve Equation 6.10 numerically, and compare the calculated to the value 
obtained in Equation 6.8 for the simpler model. Assume T r — 0 and T\ t — T ov = 
0.1. Vary X from 10 -7 to 10 -2 . T = 1. 

5. In this problem, we will look at checkpointing for real-time systems. You have 
a task with an execution time of T and a deadline of D. N checkpoints are 
placed equidistantly through the lifetime of the task. The overhead for each 
checkpoint is T ov . Point transient failures occur at a constant rate X. 

a. Derive a first-order model for the probability of missing a deadline, by 
conditioning on the number of failures over [0, T + NT 0V ]. Start by calcu¬ 
lating the probability of missing a deadline if there is exactly one failure 
over [0,T + NT ov ]. Then, find lower and upper bounds for the probability 
of missing a deadline if there is more than one failure over [0, T + NT 0V ]. 
Use the total probability formula to derive expressions for lower and upper 
bounds of this probability. 

b. Plot the upper bound of the deadline-missing probability as a function of 
N, where N varies from 0 to min(20, |_(P — T)/T ov J). 

bl . Set X = 10 -5 , P = 1.0, T ov = 0.05, and plot curves for the following val¬ 
ues of T: 0.5,0.6,0.7. 

b2. Set X — 10 -5 , P = 1.0, T = 0.6, and plot curves for the following values 
of T ov : 0.01,0.05,0.09. 

b3. Set P = 1.0, T = 0.6, T ov = 0.05, and plot curves for the following values 
of X: 10- 3 ,KT 5 ,10“ 7 . 

6 . In this problem, we will study what happens if the checkpoint overheads are 
not constant over time but vary. That is, there are times when the size of the 
process state is small and others when they are substantial. Suppose you are 
given this information, namely, you have a function, T ov (f), which is the check¬ 
pointing overhead f seconds into the task execution. 

a. Devise an algorithm to place checkpoints in such a way that the expected 
overall overhead is approximately minimized. (You may want to consult 
reference works on optimization for this). You can assume that if the execu¬ 
tion time is T and failure occurs at constant rate X, XT 1. 

b. Let T ov (t) = 10 + sin(f). For T = 1000 and failure rate X = 10 -5 , run your 
algorithm to place the checkpoints appropriately. 
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7. Identify all the consistent recovery lines in the following execution of two con¬ 
current processes: 



8 . Suppose you are designing a checkpointing scheme for a distributed system 
specified to be single-fault tolerant. That is, the system need only guarantee 
successful recovery from any one failure: a second failure before the system 
has recovered from the first one is assumed to be of negligible probability You 
decide to take checkpoints and carry out message-logging. Show that it is suffi¬ 
cient for each processor to simply record the messages it sends out in its volatile 
memory. (By volatile memory, we mean memory that will lose its contents in 
the event of a failure). 

9. We have seen that checkpointing distributed systems is quite complex and that 
uncoordinated checkpointing can give rise to a domino effect. In this problem, 
we will run a simulation to get a sense of how likely it is that a domino effect 
will happen. 

You have N processors, each of which has its own clock. A processor check¬ 
points when its clock reads nT for n — 1,2,_Each processor has its own clock. 

If t is the time told by a perfect clock, the time told by any of these clocks is 
given by t + e, where e is uniformly distributed over the range [—A, A], The 
clocks are therefore synchronized with a maximum skew between any two 
clocks of 2 A. 

The messages sent out by the processors can be modeled as follows. Each 
processor generates messages according to a Poisson process with rate ji; any 
message can be to any of the N — 1 other processors with equal probability. 

Failures strike processors according to a Poisson process with rate k, and 
processors fail independently of one another. 

Write a simulation to evaluate the probability that the domino effect hap¬ 
pens in this system. (If you are not familiar with how to write such simulations, 
look in Chapter 10.) Study the impact of varying N, A, k, and /i. Comment on 
your results. 
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Case Studies 


The purpose of this chapter is to illustrate the practical use of methods de¬ 
scribed previously in the book, by highlighting the fault-tolerance aspects of six 
different computer systems that have various fault-tolerance techniques imple¬ 
mented in their design. We do not aim at providing a comprehensive, low-level 
description; for that, the interested reader should consult the references mentioned 
in the Further Reading section. 

7.1 NonStop Systems 

Several generations of NonStop systems have been developed since 1976, by Tan¬ 
dem Computers (since acquired by Hewlett Packard). The main use for these fault- 
tolerant systems has been in online transaction processing, where a reliable re¬ 
sponse to inquiries in real time must be guaranteed. The fault-tolerance features 
implemented in these systems have evolved through several generations, taking 
advantage of better technologies and newer approaches to fault tolerance. In this 
section we present the main (although not all) fault-tolerance aspects of the Non- 
Stop designs. 

7.1.1 Architecture 

The NonStop systems have followed four key design principles, listed below. 

■ Modularity. The hardware and software are constructed of modules of fine 
granularity. These modules constitute units of failure, diagnosis, service, 
and repair. Keeping the modules as decoupled as possible reduces the prob¬ 
ability that a fault in one module will affect the operation of another. 
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■ Fail-Fast Operation. A fail-fast module either works properly or stops. Thus, 
each module is self-checking and stops upon detecting a failure. Hardware 
checks (through error-detecting codes; see Chapter 3) and software consis¬ 
tency tests (see Chapter 5) support fail-fast operation. 

■ Single Failure Tolerance. When a single module (hardware or software) fails, 
another module immediately takes over. For processors, this means that a 
second processor is available. For storage modules, it means that the mod¬ 
ule and the path to it are duplicated. 

■ Online Maintenance. Hardware and software modules can be diagnosed, dis¬ 
connected for repair and then reconnected, without disrupting the entire 
system's operation. 

We next discuss briefly the original architecture of the NonStop systems, focusing 
on the fault-tolerance features. In the next two sections, the maintenance aids and 
software support for fault tolerance are presented. Finally, we describe the modi¬ 
fications which have been made to the original architecture. 

Although there have been several generations of NonStop systems, many of 
the underlying principles remain the same and are illustrated in Figure 7.1. The 
system consists of clusters of computers, in which a cluster may include up to 16 
processors. Each custom-designed processor has a CPU, a local memory contain¬ 
ing its own copy of the operating system, a bus control unit, and an I/O channel. 
The CPU differs from standard designs in its extensive error detection capabili¬ 
ties to support the fail-fast mode of operation. Error detection on the datapath is 
accomplished through parity checking and prediction, whereas the control part 
is checked using parity, detection of illegal states, and specially designed self¬ 
checking logic (the description of which is beyond the scope of this book, but a 
pointer to the literature is provided in the Further Reading section). In addition, 
the design includes several serial-scan shift registers, allowing fast testing to iso¬ 
late faults in field-replaceable units. 

The memory is protected with a Hamming code capable of single-error correc¬ 
tion and double-error detection (see Section 3.1). The address is protected with a 
single-error-detection parity code. 

The cache has been designed to perform retries to take care of transient faults. 
There is also a spare memory module that can be switched in if permanent failures 
occur. The cache supports a write-through policy, guaranteeing the existence of a 
valid copy of the data in the main memory. A parity error in the cache will force a 
cache miss followed by refetching of the data from the main memory. 

Parity checking is not limited to memory units but is also used internally in the 
processor. All units that do not modify the data, such as buses and registers, prop¬ 
agate the parity bits. Other units that alter the data, such as arithmetic units and 
counters, require special circuits that predict the parity bits based on the data and 
parity inputs. The predicted parity bits can then be compared to the parity bits 
generated out of the produced outputs, and any mismatch between the two will 
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raise a parity error indication. This technique is discussed in Chapter 9 and is very 
suitable to adders. Extending it to multipliers would result in a very complicated 
circuit, and consequently, a different technique to detect faults in the multiplier 
has been followed. After each multiply operation, a second multiplication is per¬ 
formed with the two operands exchanged and one of them shifted prior to the 
operation. Since the correlation between the results of the two multiplications is 
trivial, a simple circuit can detect faults in the multiply operation. Note that even 
a permanent fault will be detected because the same multiplication is not repeated. 
This error detection scheme is similar to the recomputation with shifted operands 
technique for detecting faults in arithmetic operations (see Section 5.2.4). 

Note the absence of a shared memory in Figure 7.1. A shared memory can sim¬ 
plify the communication among processors but may become a single point of fail¬ 
ure. The 16 (or fewer) processors operate independently and asynchronously and 
communicate with each other through messages sent over the dual Dynabuses. 
The Dynabus interface is designed such that a single processor failure will not dis¬ 
able both buses. Similar duplication is also followed in the I/O systems, in which a 
group of disks is controlled by dual-ported controllers which are connected to I/O 
buses from two different processors. One of the two ports is designated as the pri¬ 
mary If the processor (or its associated I/O bus) that is connected to the primary 
port fails, the controller switches to the secondary/backup port. With dual-ported 
controllers and dual-ported I/O devices, four separate paths run to each device. 
All data transfers are parity-checked, and a watchdog timer detects if a controller 
stops responding or if a nonexistent controller was addressed. 

The above design allows the system to continue its operation despite the fail¬ 
ure of any single module. To further support this goal, the power, cabling and 
packaging were also carefully designed. Parts of the system are redundantly pow¬ 
ered from two different power supplies, allowing them to tolerate a power supply 
failure. In addition, battery backups are provided so that the system state can be 
preserved in case of a power failure. 

The controllers have a fail-fast requirement similar to the processors. This is 
achieved through the use of dual lock-stepped microprocessors (executing the 
same instructions in a fully synchronized manner) with comparison circuits to de¬ 
tect errors in their operation, and self-checking logic to detect errors in the remain¬ 
ing circuitry within the controller. The two independent ports within the controller 
are implemented using physically separated circuits to prevent a fault in one from 
affecting the other. 

The system supports disk mirroring (see Section 3.2), which, when used, pro¬ 
vides eight paths for data read and write operations. Disk mirroring is further dis¬ 
cussed in Section 7.1.3. The disk data is protected by end-to-end checksums (see 
Section 3.1). For each data block, the processor calculates a checksum and appends 
it to the data written to the disk. This checksum is verified by the processor when 
the data block is read from the disk. The checksum is used for error detection, 
whereas the disk mirroring is used for data recovery. 



Processor 0 



FIGURE 7.1 Original Nonstop system architecture. 
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7.1.2 Maintenance and Repair Aids 

Special effort has been made to automatically detect errors, analyze them, and re¬ 
port the analysis to remote support centers, and then track related repair actions. 
The system includes a maintenance and diagnostic processor which communicates 
with all the processors in the system and with a remote service center. This main¬ 
tenance processor collects failure related information and allows engineers at the 
remote center to run diagnostic tests. It is also capable of reconfiguring the system 
in response to detected faults. 

Internally, each computing processor module has a diagnostic unit which mon¬ 
itors the status of the computing processor and the associated logic, including 
the memory, the Dynabus interface, and the I/O channel. It reports to the cen¬ 
tral maintenance processor any errors that are detected. In addition, the diagnostic 
unit, upon a request received from the remote service center (through the central 
maintenance processor), can force the computing processor to run in a single-step 
mode and collect diagnostic information obtained through the scan paths. It can 
also generate pseudo-random tests and run them on the different components of 
the computing processor module. 

The central maintenance processor is capable of some automatic fault diagnosis 
through the use of a knowledge database that includes a large number of known 
error values. It also controls and monitors a large number of sensors for power 
supply voltages, intake and outlet air temperatures, and fan rotation. 


7.1.3 Software 

As should be clear by now, the amount of hardware redundancy in the original 
NonStop system was quite limited, and massive redundancy schemes, such as 
triple modular redundancy, were avoided. Almost all redundant hardware mod¬ 
ules that do exist (such as redundant communication buses) contribute to the 
performance of the fault-free system. Most of the burden of the system fault- 
tolerance is borne by the operating system (OS) software. The OS detects failures 
of processors or I/O channels and performs the necessary recovery. It manages 
the process pairs that constitute the primary fault-tolerance scheme used in Non- 
Stop. A process pair includes a primary process and a passive backup process that 
is ready to become active when the primary process fails. When a new process 
starts, the OS generates a clone of this process on another processor. This backup 
process goes immediately into a passive mode and waits for messages from ei¬ 
ther its corresponding primary or the OS. At certain points during the execution 
of the primary process, checkpoints are taken (see Chapter 6), and a checkpoint¬ 
ing message containing the process state is sent by the primary to the backup. The 
process state of the backup is updated by the OS, whereas the backup process it¬ 
self remains passive. If the primary process fails, the OS orders the backup to start 
execution from the last checkpoint. 
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Processors continuously check on each other's health through sending "I am 
alive" messages once every second to all other processors (over the two inter¬ 
processor buses) and to themselves (to verify that the bus send and receive circuits 
are working). Every two seconds, each processor checks whether it has received 
at least one "I am alive" message from every other processor. If such a message is 
missing, the corresponding processor is declared faulty and all outstanding com¬ 
munications with it are canceled. All processors operate as independent entities, 
and no master processor exists that could become a single point of failure. 

An important component of the OS is the disk access process, which provides 
reliable access to the data on the disks despite any failure in a processor, chan¬ 
nel, controller, or the disk module itself. This process is also implemented as a 
(primary/backup) process pair, and it manages a pair of mirrored disks that are 
connected through two controllers and two I/O channels providing eight possible 
paths to the data. As was indicated in Section 3.2, mirrored disks provide better 
performance through shorter read times (by preferring the disk with the shorter 
seek time) and support of multiple read operations. Disk write operations are more 
expensive, but not necessarily much slower, since the two writes are done in par¬ 
allel. 

Because transaction processing has been the main market for the NonStop sys¬ 
tems, special care has been taken to ensure reliable transactions. A Transaction 
Monitoring Module (of the OS) controls all the steps from the beginning of the 
transaction to its completion, going through multiple database accesses and mul¬ 
tiple file updates on several disks. This module guarantees that each transaction 
will have the standard so-called ACID properties required of databases: 

■ Atomic. Either all, or none, of the database updates are executed. 

■ Consistent. Every successful transaction preserves the consistency of the 
database. 

■ Isolated. All events within a transaction are isolated from other transactions 
which may execute concurrently to allow any failing transaction to be reset. 

■ Durable. Once a transaction commits, its results survive any failure. 

Any failure during the execution of a transaction will result in an abort-transaction 
step, which will undo all database updates. 

Most of the above techniques focus on tolerating hardware failures. To deal with 
software failures, numerous consistency checks are included in every software 
module, and upon the detection of a problem, the processor is halted, resulting 
in the backup process being initiated. These consistency checks stop the process 
when a system data structure becomes contaminated, reducing considerably the 
chances of a database contamination. They also make system software errors very 
visible, allowing their correction, thus resulting in high-quality software. 
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FIGURE 7.2 Modified NonStop system architecture. 

7.1.4 Modifications to the NonStop Architecture 

Numerous modifications have been integrated into the hardware and software 
design of the NonStop systems as they evolved over time. We describe in what 
follows only the most significant ones. 

The original NonStop architecture relied heavily on custom-designed proces¬ 
sors with extensive use of self-checking techniques to allow processors to follow 
the fast-fail design principle. With the rapid increase in the cost of designing and 
fabricating custom processors, the original approach was no longer economically 
viable, and the architecture was modified to use commercial microprocessors. 
Such microprocessors do not support the level of self-checking that is required 
for the fast-fail operation, and consequently, the design was changed to a scheme 
based on tight lock-stepping of pairs of microprocessors as shown in Figure 7.2. 
A memory operation will not be executed unless the two separate requests are 
identical; if they are not, the self-checked processor will stop executing its task. 
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Another significant modification to the architecture is the replacement of 
the I/O channels and the interprocessor communication links (through the 
Dynabuses; see Figure 7.1) by a high-bandwidth, packet-switched network called 
ServerNet, shown in Figure 7.2. As the figure shows, this network is comprised 
of two independent fabrics so that a single failure can disrupt the operation of at 
most one fabric. Both fabrics are used by all the processors: each processor decides 
independently which fabric to use for a given message. 

The ServerNet provides not only high bandwidth and low latency but also bet¬ 
ter support for detection and isolation of errors. Each packet transferred through 
the network is protected with a Cyclic Redundancy Check (CRC; see Section 3.1). 
Every router that forwards the packet checks the CRC and appends either a "This 
packet is bad" or "This packet is good" flag to the packet, allowing easy isolation 
of link failures. 

Current trends in commercial microprocessors are such that achieving self¬ 
checking through lock-step operation will no longer be viable: guaranteeing that 
two microprocessors will execute a task in a fully synchronous manner is becom¬ 
ing very difficult, if not impractical. The reasons for this include (1) the fact that 
certain functional units within microprocessors use multiple clocks and asynchro¬ 
nous interfaces; (2) the need to deal with soft errors (which become more likely 
as VLSI feature sizes become smaller) leads to low-level fix-up routines that may 
be executed on one microprocessor and not the other, and (3) the use of variable 
frequencies by power/temperature management techniques. Moreover, most fu¬ 
ture high-end microprocessors will have multiple processor cores running multi¬ 
ple tasks. A failure in one processor running a single task in a lock-stepped mode 
will disrupt the operation of multiple processors—an undesirable event. 

To address the above, the NonStop system architecture has been further modi¬ 
fied, moving from tight lock-step to loose lock-step operation. Instead of compar¬ 
ing the outputs of the individual processors every memory operation, only the 
outputs of 1/O operations are compared. As a result, variations due to soft-error 
corrections, cache retries, and the like, are more likely to be tolerated and not result 
in mismatches. Furthermore, the modified NonStop architecture also allows triple 
modular redundancy (TMR; see Chapter 2) configurations. The standard NonStop 
configuration of dual redundancy can only detect errors, whereas the TMR con¬ 
figuration allows uninterrupted operation even after a failure or a mismatch due 
to asynchronous executions of the copies of the same task. An additional benefit 
of the TMR configuration is that it is capable of protecting applications that do not 
follow the recommended implementation as primary/backup process pairs. 

Stratus Systems 

The Stratus fault tolerant system has quite a few similarities to the NonStop sys¬ 
tem described above. Every unit in both systems is replicated (at least once) to 
avoid single points of failure. This includes the processors, memory units, I/O con¬ 
trollers, disk and communication controllers, buses, and power supplies. The main 
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FIGURE 7.3 A single pair in a Stratus system. 


difference between the two types of system is that the NonStop fault-tolerance ap¬ 
proach focuses mainly on the software, whereas the Stratus design achieves its 
fault tolerance mainly through hardware redundancy. As a result, off-the-shelf 
software need not be modified to consist of primary/backup process pairs before 
running it on a Stratus server. 

Stratus systems use the pair-and-spare principle described in Section 2.3.6, in 
which each pair consists of two processors operating in lock-step mode. The archi¬ 
tecture of a single pair is shown in Figure 7.3. Upon a mismatch between the two 
CPUs, the pair will declare itself faulty and will no longer be involved in produc¬ 
ing results. The second pair will continue to execute the application. 

As discussed in the previous section, modern off-the-shelf microprocessors 
have asynchronous behavior. For this reason, enforcing a tight lock-step opera¬ 
tion that requires a match for every memory operation would drastically decrease 
performance. Consequently, in more recent designs of Stratus servers (as shown 
in Figure 7.3), only the I/O outputs from the motherboards are compared and a 
mismatch will signal an error. A motherboard consists of a standard microproces¬ 
sor, a standard memory unit and a custom unit that contains the I/O interface and 
interrupt logic. 

Similarly to NonStop systems, current Stratus systems can be configured to use 
TMR structures with voting to detect or mask failures. If such a TMR configuration 
suffers a processor or memory failure, it can be reconfigured to a duplex until the 
failed unit has been repaired or replaced. 

Unlike NonStop systems, the memory unit is also duplicated allowing the con¬ 
tents of the main memory to be preserved through most system crashes. The I/O 
and disks are duplicated as well, with redundant paths connecting individual I/O 
controllers and disks to the processors. The disk systems use disk mirroring (see 
Section 3.2). A disk utility checks for bad blocks on the disks and repairs them by 
copying from the other disk. 




CHAPTER 7 Case Studies 


The processors, memories, and I/O units have hardware error-checking and the 
error signals that they generate are used by the system software which includes 
extensive detection and recovery capabilities for both transient and permanent 
faults. Hardware components judged to have failed permanently are removed, 
and the provided redundancy ensures that in most cases the system can continue 
to function despite the removal of the failed component. A component that was 
hit by a transient fault but has since recovered is restarted and rejoins the system. 

Device drivers, which cause a significant fraction of operating system crashes, 
are hardened to reduce their failure rate. Such hardening takes the form of (a) 
reducing the chances that a device will malfunction, (b) promptly detecting the 
malfunctioning of a device, and (c) dealing with any such malfunctioning locally 
as much as possible to contain its effects and prevent it from propagating to the 
operating system. 

I/O device malfunctioning probability can be reduced, for example, by running 
sanity checks on the input, thus protecting the device from an obviously bad input. 
Prompt detection can be carried out by using timeouts to detect device hangs and 
to check the value returned by the device for obvious errors. In some cases, it may 
be possible—when the device is otherwise idle—to make it carry out some test 
actions. 

Upon a system crash, an automatic reboot is carried out. One of the CPUs is 
kept offline in order to dump its memory to disk: such a dump can be analyzed to 
diagnose the cause of the failure. Once this dump has been completed, the offline 
CPU can be resynchronized with its functioning counterpart(s) and rejoin the sys¬ 
tem. If the reboot is unsuccessful, the system is powered down and then powered 
up again, followed by another reboot attempt. 

Every fault detected by the system is reported to a remote Stratus support cen¬ 
ter, allowing service engineers to continuously monitor the system and, if neces¬ 
sary, troubleshoot and resolve problems online. If permanent faults are detected, 
hot-swappable replacement parts are automatically ordered and shipped to the 
customer. 

Cassini Command and Data Subsystem 

The Cassini spacecraft was designed to explore Saturn and its satellites. Launched 
in 1997, it reached Saturn in 2004 and is scheduled to continue its mission through 
2008. The activity level was relatively low until the spacecraft reached Saturn; since 
then, it has launched the Huygens probe to study the satellite Titan, and has car¬ 
ried out detailed studies of Saturn, its rings, and several of its satellites. 

The spacecraft has three mission modes: normal, which takes up most of the 
mission; mission-critical, which occurs during three critical stages of the mission: 
launch, Saturn orbit insertion, and Titan probe relay; and safing, in which the satel¬ 
lite has suffered a fault and has to be placed in a configuration that is safe and 
appropriate for manual intervention from Earth. 
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The Command and Data Subsystem (CDS) issues commands to the other sub¬ 
systems and controls the buffering and formatting of data for sending back to 
Earth. In particular, it has the following functions: 

■ Communications. Management of commands from the ground and of teleme¬ 
try to send data from the spacecraft to Earth. Also, communication with the 
spacecraft's engineering and science subsystems (such as the Attitude and 
Articulation Control [AACS] and the Radio Frequency [RFS] subsystems). 

■ Command Sequencing. Storing and playing out command sequences to man¬ 
age given activities such as launch and Saturn orbit insertion. 

■ Time Keeping. Maintaining the spacecraft time reference, to coordinate activ¬ 
ity and facilitate synchronization. 

■ Data Handling. Buffering data as needed if the data collection rate is greater 
than the downlink transmission rate. 

■ Temperature Control. Monitoring and managing spacecraft temperatures. 

■ Faidt Protection. Running algorithms which react to faults detected either 
outside or in the CDS. 

Because the spacecraft is meant to operate for about 11 years without any 
chance of hardware replacement or repair, the CDS must be fault tolerant. Such 
fault tolerance is provided by a dual-redundant system. 

Figure 7.4 provides a block diagram of the CDS. The heart of the CDS is a pair 
of flight computers, each with very limited memory: 512 KWords of RAM and 
8 KWords of PROM. For storage of data meant for transmission to Earth, there 
are two solid-state recorders, each of 2 GBit capacity. Each flight computer is con¬ 
nected to both recorders. Communication is by means of a dual-redundant 1553B 
bus. The 1553B bus was introduced in the 1970s and consists of the cable (plus 
couplers and connectors), a bus controller that manages transmissions on the bus 
(all traffic on the bus either originates with the bus controller or is in response to a 
bus controller command), and a remote terminal at each flight computer, to allow 
it to communicate with the other computer. Sensors connected to the bus provide 
the flight computers with state information, such as temperature, pressure, and 
voltage levels. One flight computer is the primary at any given time; the other is a 
backup. The bus controller of the backup computer is inhibited; that of the primary 
is the only one that is active. 

The CDS was designed under the assumption that the system will never have to 
cope with multiple faults at any given time. Apart from a specified set of failures, 
the system is supposed to be protected against any single failure. The exception set 
includes stuck bits in the interface circuitry that take the CDS to an uncommanded 
state, design faults, and the issuing of wrong commands from Earth. 

Errors are classified according to the location of the corresponding fault (central 
vs. peripheral), their impact (noninterfering vs. interfering), and their duration 
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FIGURE 7.4 Cassini CDS block diagram. 


(transient vs. permanent). Central faults are those that occur in one of the flight 
computers; faults occurring in other units, such as the solid-state recorders, bus, 
or the sensor units, are classified as peripheral. 

Noninterfering faults are, as the term implies, faults that do not affect any ser¬ 
vice that is necessary to the current mission phase. For some such faults, it is suf¬ 
ficient to log them for future analysis; for others, some corrective action may need 
to be taken. Interfering faults are those that affect a service that is important to the 
current mission phase. Transient faults can be allowed to die away, and then the 
system is restored to health; permanent faults require either automatic switching 
to a redundant entity or placing the spacecraft in a safe mode and awaiting instruc¬ 
tions from ground control. As a general rule, if a fault can be handled by ground 
control, it is so handled: the philosophy is to carry out autonomous recovery only 
if ground-based intervention is not practical. 

If the CDS itself fails for a substantial period of time, this is detected by the 
AACS which places the spacecraft in a default "safe mode" to wait for the CDS 
to recover. The AACS also has the ability to recognize some obviously unsafe op¬ 
erating configurations, and can reject orders to configure the system in an unsafe 
way. 
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7.4 IBM G5 

The IBM G5 processor makes extensive use of fault-tolerance techniques to recover 
from transient faults that constitute the majority of hardware faults (see Chapter 
1). Fault tolerance is provided for the processor, memory, and I/O systems. In the 
processor and I/O systems, this takes the form of physical replication; in memory, 
extensive use is made of error detection and correction codes of the type described 
in Section 3.1. In addition, extensive hardware support is provided for rollback 
recovery from transient failures (see Chapter 6). 

Traditional redundancy methods are used to implement fault tolerance in the 
I/O subsystem. There are multiple paths from the processor to the I/O devices: 
these can be dynamically switched as necessary to route around faults. Inline error 
checking is provided, and the channel adapters are designed to prevent interface 
errors from propagating into the system. 

The G5 processor pipeline includes an I-unit, which is responsible for fetch¬ 
ing instructions, decoding them, generating any necessary addresses, and placing 
pending instructions in an instruction queue. There is an E-unit, which executes 
the instructions and updates the machine state. Both the I- and E-units are dupli¬ 
cated: they work in lock-step, which allows the results of their activity to be com¬ 
pared. A successful comparison indicates that all is well; a divergence between the 
two instances indicates an error. 

In addition, the processor has an R-unit, which consists of 128 32-bit and 128 
64-bit registers. The R-unit is used to store the checkpointed machine state to fa¬ 
cilitate rollback recovery: this includes general-purpose, status word, and control 
registers. The R-unit registers are protected by an error-correcting code (ECC), and 
the R-unit is updated whenever the duplicate E-units generate identical results. 

The processor has an ECC-protected store buffer, into which pending stores can 
be written. When a store instruction commits, the relevant store buffer entry can 
be written into cache. 

All writes to the LI cache are also written through to the L2 cache; as a result, 
there is always a backup copy of the LI contents. The L2 cache and the main mem¬ 
ory, as well as the buses connecting the processor to the L2 cache and the L2 cache 
to main memory, are protected using ECC (a (72,64) SEC/DED Hamming code; 
see Section 3.1), whereas errors in LI are detected using parity. When an L2 line 
is detected as erroneous, it is invalidated in cache. If this line is dirty (i.e., was 
modified since being brought in from main memory), the line is corrected if pos¬ 
sible and the updated line is stored in the main memory If it is not possible to 
correct the error, the line is invalidated in cache and steps are taken to prevent the 
propagation of the erroneous data. 

Special logic detects the same failures happening repeatedly in the same storage 
location in the L2 cache. Such repeated identical failures are taken to indicate a 
permanent fault; the affected cache line is then retired from use. 

The data in the main memory are protected by the same (72,64) SEC/DED code, 
and the address bus is protected using parity bits, one parity bit for every 24 bits. 
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Memory scrubbing is used to prevent transient memory errors from accumulating. 
Memory scrubbing consists of regularly reading the memory, word by word, and 
correcting any bit errors encountered. This way, memory errors are corrected be¬ 
fore they accumulate and their number exceeds the correction capabilities of the 
SEC/DED code. Spare DRAM is also provided, which can be switched in to re¬ 
place a malfunctioning memory chip. 

G5 systems have a variety of responses to errors. Localized data errors in the 
registers or the L2 cache can be corrected by means of an ECC. Errors in the LI 
cache are detected by means of parity and corrected by using the corresponding 
copy in the L2 cache. If a processor operation results in an erroneous output (de¬ 
tected by disagreeing outputs from the duplicated I or E-units), the system retries 
the instruction in the hope that the error was caused by a transient fault. Such a 
retry is started by freezing the checkpointed state: updates to the R-unit are not 
permitted. Pending write-throughs to the L2 cache from instructions that have 
already been checkpointed are completed. The checkpointed state held in the R- 
unit is loaded into the appropriate machine registers and the machine is restarted 
from the checkpointed state. Note that this is not a system checkpointing process 
(which, upon a failure, re-executes a large section of the application) of the type 
that has been described in Chapter 6. Instead, it is a hardware-controlled process 
for instruction retry and is transparent even to the operating system. 

There may be instances in which recovery fails. For example, a permanent fault 
that results in repeated errors may occur. In such an event, the checkpoint data 
are transferred to a spare processor (if available) and execution continues on that 
processor. 

Unless the system runs out of spares to deal with permanent failures or the 
checkpointed data are found to have been corrupted, a failure and the subsequent 
recovery will be transparent to the operating system and the application: the re¬ 
covery process is generally handled rapidly in hardware. 


7.5 IBM Sysplex 

The IBM Sysplex is a multinode system that offers some fault-tolerance protec¬ 
tion for enterprise applications. The system is configured as shown in Figure 7.5. 
A number of computing nodes (up to 32) are interconnected; each node is either a 
single- or multiple-processor entity. The system includes a global timer, which pro¬ 
vides a common time reference to unambiguously order the events across nodes. 
A storage director connects this cluster of processors to shared storage, in the form 
of multiple disk systems. This storage is equally shared: every node has access to 
any part of it. Connection between the computing nodes and the storage devices is 
made fault tolerant through redundant connections. The storage itself can be made 
sufficiently reliable through coding or replication. The existence of truly shared 
disk storage makes it possible for applications running on one node to be easily 
restarted on another. 
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FIGURE 7.5 IBM Sysplex configuration. 

Processes indicate, through a registration service, whether a restart may be re¬ 
quired. When a process is completed, it deregisters itself, to indicate that it will no 
longer require restart. 

When the system detects a node failure, it must (i) try to restart that node and 
(ii) restart the applications that were running on that node. Failure detection is 
through a heartbeat mechanism: the nodes periodically emit heartbeats or "I am 
alive" messages. If a sufficiently long sequence of heartbeat messages is missed 
from a node, it is declared to have failed. False alarms can arise because it is possi¬ 
ble under some circumstances for functional nodes to miss sending out heartbeats 
at the right time. The heartbeat mechanism must therefore be carefully tuned to 
balance the need to catch failures against the need to keep the false alarm rate 
sufficiently low. 
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When a node failure is detected, the Automatic Restart Manager (ARM) takes 
charge of restarting the affected tasks. The ARM has access to the global system 
state: it is aware of the loading of each node and can carry out load balancing while 
in the process of migrating affected tasks to other nodes. The ARM is also aware 
of task affinity groups, which are tasks that must be assigned together to the same 
node (e.g., because they have a heavy amount of intercommunication), and of any 
sequencing constraints (e.g., that task P should be restarted only after task Q has 
done so). Also provided are the maximum number of restart attempts, both on the 
original node and on other nodes, as well as the amount of memory required. 

When restarting tasks on other nodes, care has to be taken that the supposedly 
failed node is really down. This is necessary to avoid the possibility of two copies 
of the same task—the original and restarted versions—both being active. Such du¬ 
plicates may be no more than a harmless waste of computational resources in some 
applications; in other cases, however, duplication may result in erroneous results 
(e.g., incorrect updates may occur in databases). Similarly, care must be taken 
when a node's access to the global shared state is lost, to ensure that erroneous 
events do not occur. For example, if node x loses access to the global state and de¬ 
cides to recover application a, it may well be that some other node y is restarting 
a as well, thus resulting in two copies of a. Sysplex deals with such problems by 
disallowing restarts on nodes which have lost access to the global state. To imple¬ 
ment such a policy, an array of system sequence numbers, SysSeqNum(), is used. 
The system sequence number associated with a node is incremented every time 
access to global shared state is lost and then re-established. Every process, P, on a 
given node x is labeled with the value of SysSeqNu m (x) at the time it registers for 
the restart service (notifies the system that it should be restarted if there is a fail¬ 
ure). Should access to the shared state now be lost and then be restored, process 
P's sequence number will no longer equal the latest value of SysSeqNum(x). P will 
now be de-registered from the recovery service. 

ARM also provides support for hot-standby mode. In such a mode, there are 
primary and secondary servers for a given application: if the primary fails, the 
output of the secondary can be used. The switchover from primary to secondary 
is much faster than when hot-standby is not used. 


Itanium 

The Intel Itanium processor is a 64-bit design, meant for use in high-end server 
and similar applications. It is an Explicitly Parallel Instruction Computer (EPIC) 
capable of executing up to six instructions per cycle, which are bundled by the 
compiler so that data dependencies are avoided. It has several built-in features for 
fault tolerance to enhance availability. 

The Itanium makes extensive use of parity and error-correcting coding in its 
data buses (where a single-bit error can be corrected), and in its three levels of 
cache. There are separate data (LID) and instruction (L1I) caches at the LI level 
while L2 and L3 are unified caches. 



7.6 Itanium 


245 


L1I and LID (both the tag and data arrays) are protected by error-detecting 
parity When an error is detected, the entire cache is invalidated. LID has byte- 
wise parity to facilitate load/store operations of granularity finer than a word. 
Since faults tend to be spatially correlated (meaning that if a particular location is 
suffering a transient fault, it is more likely that physically neighboring locations 
will be affected as well), bits from adjacent cache lines are physically interleaved 
on silicon. This reduces the probability of a (potentially undetectable) multiple-bit 
error in a given cache line. 

The L2 cache has its data array protected by error-correcting codes (a (72,64) 
SEC/DED Hamming code) and its tag array by error-detecting parity (one parity 
bit for no more than 24 bits). Errors correctable by coding are usually automatically 
corrected; other (more wide-ranging) responses are outlined below. 

Both the tag and data arrays of L3 are protected by similar error-correcting 
codes. Single-bit data errors are silently corrected when the data are written back. 
Upon a tag array error, all four ways of the relevant entry in the tag array are 
scrubbed. 

When an error in any level of the cache is detected, the system corrects it if 
possible, sends out a "corrected machine check interrupt" to indicate that such a 
correction has occurred, and resumes its normal operation. An exception to this is 
when an error is "promoted," as is described later. 

Suppose the error is not hardware-correctable. If it requires hardware error con¬ 
tainment to prevent it from spreading, a bus reset is carried out. A bus reset clears 
all pending memory and bus transactions and all the internal state machines. All 
architectural state is preserved (meaning that the register files, caches and TLBs 
[Translation Lookaside Buffers] are not cleared). 

If hardware error containment is not required, a Machine Check Abort (MCA) 
is signaled. An MCA may be either local or global. If local, it is restricted to the 
processor or thread encountering the error: information about this is not sent out 
to any other processors in the system. In contrast, all the processors will be notified 
of a global MCA. 

Error handling is done layer by layer. We have already seen that the hardware 
will correct such errors as it can. Above the hardware layer are the Processor Ab¬ 
straction (PAL) and the System Abstraction (SAL) layers, whose job it is to hide 
lower-level implementation levels concerning, respectively, the processor and the 
system external to the processor (such as the memory or the chipset) from higher- 
level entities (such as the operating system). Error handling is attempted by these 
layers in turn. If either layer can successfully handle the error, error handling can 
end there once information about the error has been sent to the operating system. 
If neither of these abstraction layers can deal with the error, the operating system 
gets into the act. For example, if an individual process is identified as the error 
source, the operating system can abort it. 

There are instances in which the error is impossible to successfully handle at 
any level. In such an instance, a reboot and I/O reinitialization may be necessary. 



246 


CHAPTER 7 Case Studies 


Such a reboot may be local to an individual processor or involve the entire system, 
depending on the nature of the error. 

In some cases, an error may be "promoted," and a higher-level response than is 
strictly necessary may be employed. For example, suppose the processor is being 
used in a duplex or some other redundant architecture in which multiple proces¬ 
sors are executing the same code, off identical inputs, and to the beat of a syn¬ 
chronized clock. The cycle-by-cycle output of the redundant processors can then 
be compared in order to detect faults. In such a setup, taking a processor out of 
lock-step to carry out a hardware error correction may not be the most appro¬ 
priate thing to do: instead, it may be best to signal a global MCA and let some 
higher-scope entity handle the problem. 

When erroneous data are detected (but not corrected), the usual response is 
to reboot the entire system (or at least the affected node if the system has mul¬ 
tiple processors). The Itanium offers a more focused approach. Erroneous data 
are marked as such (something that is called data poisoning), and any process that 
tries to use such data is aborted. The effect of erroneous data is therefore less pro¬ 
nounced, especially if used by only a small number of processes. Data poisoning is 
carried out at the L2 cache level, and the rules for implementing it are as follows: 

■ Any store to a poisoned cache line is ignored. 

■ If a poisoned line is removed from the cache (to make room for a new line), 
it is written back to main memory and a flag is raised at that location, to 
indicate that the contents are poisoned. 

■ Any process that attempts to fetch a poisoned line triggers an MCA. 

As mentioned before, once an error has been detected, information about it is 
passed on to the operating system. This can be done through an interrupt. Al¬ 
ternatively, the operating system may choose to mask out such interrupts and, 
from time to time, poll lower layers for this information. Such information can be 
used to better manage the system. For example, if a particular page frame in main 
memory is observed to suffer from a high error rate, the operating system could 
decide to stop mapping anything into it. 

Due to the extensive set of fault-tolerance mechanisms implemented in the Ita¬ 
nium (compared to most other commercial microprocessors), it has been selected 
as a building block in several fault-tolerant multiprocessors, including the most 
recent designs of the NonStop systems. 

7.7 Further Reading 

Most books on fault tolerance include descriptions of existing fault-tolerant sys¬ 
tems, for example [12,18,20]. Further details on the original Tandem systems can 
be found at [2,14,24], The more recent design of the NonStop system is described in 
[4]. Self-checking logic which is used in the design of some nonstop processors is 
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described in [13]. Design of self-checking checkers is presented in [1], The shifted 
operands technique for detecting errors in arithmetic units appears in [17,22]. 

The Stratus systems are described in white papers published by Stratus 
Technologies and available at www.stratus.com/whitep/index.htm. Hardening 
drivers to make them more resilient is discussed in [8]. 

The Cassini spacecraft CDS is described in [7]; information about the Cassini 
AACS can be found in [6]. 

The main source for the IBM G5 processor is the 1999 September/November 
special issue of the IBM Journal of Research and Development. An overview of the 
fault-tolerance techniques used in G5 is provided in [23]. Another good introduc¬ 
tion can be found in [21]. The G5 cache and the I/O system are described in [25] 
and [9], respectively 

The main reference for the IBM S/390 Sysplex is the Volume 36, No. 2, issue of 
the IBM Systems Journal , for an overview, see [16], and for a description of high- 
availability, see [5]. A very informative comparison of the IBM and HP/Tandem 
NonStop designs is included in [3]. 

Information about the Intel Itanium processor is widely available. Excellent 
introductions can be found in the September/October 2000 issue of IEEE Micro, 
which contains several relevant papers, and in [15]. Another good source is the 
Intel Corporation website, especially [10,11]. The Itanium has been used in sev¬ 
eral designs of fault-tolerant systems including IBM, NEC, Fujitsu and Hewlett- 
Packard's NonStop [4,19]. 
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Defect Tolerance in 
VLSI Circuits 


With the continuing increase in the total number of devices in VLSI circuits (e.g., 
microprocessors) and in the density of these devices (due to the reduction in their 
size) has come an increasing need for defect tolerance. Some of the millions of sub¬ 
micron devices that are included in a VLSI chip are bound to have imperfections 
resulting in yield-reducing manufacturing defects, where yield is defined as the 
percentage of operational chips out of the total number fabricated. 

Consequently, increasing attention is being paid to the development and use of 
defect-tolerance techniques for yield enhancement, to complement existing efforts 
at the manufacturing stage. Design-stage yield enhancement techniques are aimed 
at making the integrated circuit defect tolerant, or less sensitive to manufacturing 
defects, and include incorporating redundancy into the design, modifying the cir¬ 
cuit floorplan, and modifying its layout. We concentrate in this chapter on the first 
two, which are directly related to the focus of this book. 

Adding redundant components to the circuit can help in tolerating manufac¬ 
turing defects and thus increase the yield. However, too much redundancy may 
reduce the yield since a larger-area circuit is expected to have a larger number of 
defects. Moreover, the increased area of the individual chip will result in a reduc¬ 
tion in the number of chips that can fit in a fixed-area wafer. Successful designs of 
defect-tolerant chips must therefore rely on accurate yield projections to determine 
the optimal amount of redundancy to be added. We discuss several statistical yield 
prediction models and their application to defect-tolerant designs. Then, various 
yield enhancement techniques are described and their use illustrated. 

8.1 Manufacturing Defects and Circuit Faults 

Manufacturing defects can be roughly classified into global defects (or gross area 
defects) and spot defects. Global defects are relatively large-scale defects, such as 
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scratches from wafer mishandling, large-area defects from mask misalignment, 
and over- and under-etching. Spot defects are random local and small defects from 
materials used in the process and from environmental causes, mostly the result 
of undesired chemical and airborne particles deposited on the chip during the 
various steps of the process. 

Both defect types contribute to yield loss. In mature, well-controlled fabrica¬ 
tion lines, gross-area defects can be minimized and almost eliminated. Controlling 
random spot defects is considerably more difficult, and the yield loss due to spot 
defects is typically much greater than the yield loss due to global defects. This is 
especially true for large-area integrated circuits, since the frequency of global de¬ 
fects is almost independent of the die size, whereas the expected number of spot 
defects increases with the chip area. Consequently, spot defects are of greater sig¬ 
nificance when yield projection and enhancement are concerned and are therefore 
the focus of this chapter. 

Spot defects can be divided into several types according to their location and to 
the potential harm they may cause. Some cause missing patterns which may result 
in open circuits, whereas others cause extra patterns that may result in short cir¬ 
cuits. These defects can be further classified into intralayer and interlayer defects. 
Intralayer defects occur as a result of particles deposited during the lithographic 
processes and are also known as photolithographic defects. Examples of these are 
missing metal, diffusion or polysilicon, and extra metal, diffusion or polysilicon. 
Also included are defects in the silicon substrate, such as contamination in the de¬ 
position processes. Interlayer defects include missing material in the vias between 
two metal layers or between a metal layer and polysilicon, and extra material be¬ 
tween the substrate and metal (or diffusion or polysilicon) or between two sepa¬ 
rate metal layers. These interlayer defects occur as a result of local contamination, 
because of, for example, dust particles. 

Not all spot defects result in structural faults such as line breaks or short circuits. 
Whether or not a defect will cause a fault depends on its location, size, and the 
layout and density of the circuit (see Figure 8.1). For a defect to cause a fault, 
it has to be large enough to connect two disjoint conductors or to disconnect a 
continuous pattern. Out of the three circular missing-material defects appearing in 
the layout of metal conductors in Figure 8.1, the two top ones will not disconnect 
any conductor, whereas the bottom defect will result in an open-circuit fault. 

We make, therefore, the distinction between physical defects and circuit faults. 
A defect is any imperfection on the wafer, but only those defects that actually 
affect the circuit operation are called faults: these are the only ones causing yield 
losses. Thus, for the purpose of yield estimation, the distribution of faults, rather 
than that of defects, is of interest. 

Some random defects that do not cause structural faults (also termed functional 
faults) may still result in parametric faults; that is, the electrical parameters of some 
devices may be outside their desired range, affecting the performance of the cir¬ 
cuit. For example, although a missing-material photolithographic defect may be 
too small to disconnect a transistor, it may still affect its performance. Parametric 
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FIGURE 8.1 The critical area for missing-metal defects of diameter x. I. Koren and Z. Ke¬ 
ren, "Defect Tolerant VLSI Circuits: Techniques and Yield Analysis," Proceedings of the IEEE © 1998 
IEEE. 

faults may also be the result of global defects that cause variations in process para¬ 
meters. This chapter does not deal with parametric faults and concentrates instead 
on functional faults, against which fault-tolerance techniques can be used. 


Probability of Failure and Critical Area 


We next describe how the fraction of manufacturing defects that result in func¬ 
tional faults can be calculated. This fraction, also called the probability of failure 
(POF), depends on a number of factors: the type of the defect, its size (the greater 
the defect size, the greater the probability that it will cause a fault), its location, 
and circuit geometry. A commonly adopted simplifying assumption is that a de¬ 
fect is circular with a random diameter x (as shown in Figure 8.1). Accordingly, we 
denote by dfx) the probability that a defect of type i and diameter x will cause a 
fault, and by dj the average POF for type i defects. Once Ofx) is calculated, 0 t can 
be obtained by averaging over all defect diameters x. Experimental data lead to 
the conclusion that the diameter iota defect has a probability density function, 
fd(x), given by 


fd(x) = 


{kx-P 

1 ° 


if Xq < X < Xm 

otherwise 


( 8 . 1 ) 


where k — (p — 1)Xq 1 x% l 1 /(x^ M 1 — Xq X ) is a normalizing constant, xq is the reso¬ 
lution limit of the lithography process, and x M is the maximum defect size. The 
values of p and x M can be determined empirically and may depend on the defect 
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type. Typically, p ranges in value between 2 and 3.5. 0, can now be calculated as 


r*M 

9i= / Oj(x)fd(x)dx 
Jx 0 


(8.2) 


(c) 

Analogously, we define the critical area, A ■ (x), for defects of type i and diameter x 
as the area in which the center of a defect of type i and diameter x must fall in order 

(c) 

to cause a circuit failure, and by A . the average over all defect diameters x of these 

(c) 

areas. Al is called the critical area for defects of type i, and can be calculated as 


A T= f XM A ( f\x)f d (x)dx 
Jx 0 


( 8 . 3 ) 


Assuming that given a defect, its center is uniformly distributed over the chip 
area, and denoting the chip area by A c | ljp , we obtain 


Af\x) 

e i (x)=-^J. 

^chip 

and consequently, from Equations 8.2 and 8.3, 


( 8 . 4 ) 


l(c) 


*chip 


( 8 . 5 ) 


Since the POF and the critical area are related through Equation 8.5, any one of 
them can be calculated first. There are several methods of calculating these para- 

(c) 

meters. Some methods are geometry based, and they calculate A. (x) first, others 
are Monte Carlo-type methods, where 0,(x) is calculated first. 

We illustrate the geometrical method for calculating critical areas through the 
VLSI layout in Figure 8.1, which shows two horizontal conductors. The critical 
area for a missing-material defect of size x in a conductor of length L and width w 
is the size of the shaded area in Figure 8.1, given by 




0 if x < iv 

(x — w)L + \{x — w)J x 2 — w 2 ifx^zv 


( 8 . 6 ) 


The critical area is a quadratic function of the defect diameter, but for L zv, the 
quadratic term becomes negligible. Thus, for long conductors we can use just the 

(c) 

linear term. An analogous expression for Ag X ' tra (x) for extra-material defects in a 
rectangular area of width s between two adjacent conductors can be obtained by 
replacing zv by s in Equation 8.6. 

Other regular shapes can be similarly analyzed, and expressions for their criti¬ 
cal area can be derived. Common VLSI layouts consist of many shapes in different 
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sizes and orientations, and it is very difficult to derive the exact expression for the 
critical area of all but very simple and regular layouts. Therefore, other techniques 
have been developed, including several more efficient geometrical methods and 
Monte Carlo simulation methods. One geometrical method is the polygon expan¬ 
sion technique, in which adjacent polygons are expanded by x/ 2 and the intersec¬ 
tion of the expanded polygons is the critical area for short-circuit faults of diame¬ 
ter x. 

In the Monte Carlo approach, simulated circles representing defects of different 
sizes are "placed" at random locations of the layout. For each such "defect," the 
circuit of the "defective" IC is extracted and compared with the defect-free circuit 
to determine whether the defect has resulted in a circuit fault. The POF, Ofx), is 
calculated for defects of type i and diameter x, as the fraction of such defects that 
would have resulted in a fault. It is then averaged using Equation 8.2 to produce 
(c) 

6j, and A- = OjA c hi p . An added benefit of the Monte Carlo method is that the cir¬ 
cuit fault resulting from a given defect is precisely identified. Flowever, the Monte 
Carlo approach has traditionally been very time consuming. Only recently have 
more efficient implementations been developed, allowing this method to be used 
for large ICs. 

(c) 

Once Aj (or (),) has been calculated for every defect type i, it can be used as 
follows. Let dj denote the average number of defects of type i per unit area; then 
the average number of manufacturing defects of type i on the chip is A^^d,. The 
average number on the chip of circuit faidts of type i can now be expressed as 

diA ch ipdi=Af ) d i . 

In the rest of this chapter, we will assume that the defect densities are given and 
the critical areas are calculated. Thus, the average number of faults on the chip, 
denoted by X, can be obtained using 

A. = J2 A f d i = J2 °' A chipd, ( 8 . 7 ) 

i i 

where the sum is taken over all possible defect types on the chip. 


Basic Yield Models 

To project the yield of a given chip design, we can construct an analytical probabil¬ 
ity model that describes the expected spatial distribution of manufacturing defects 
and, consequently, of the resulting circuit faults that eventually cause yield loss. 
The amount of detail needed regarding this distribution differs between chips that 
have some incorporated defect tolerance and those which do not. In case of a chip 
with no defect tolerance, its projected yield is equal to the probability of no faults 
occurring anywhere on the chip. Denoting by X the number of faults on the chip, 
the chip yield, denoted by Y c hi p , is given by 

^chip = ProbfX = 0} 


(8.8) 
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If the chip has some redundant components, projecting its yield requires a more 
intricate model that provides information regarding the distribution of faults over 
partial areas of the chip, as well as possible correlations among faults occurring 
in different subareas. In this section we describe statistical yield models for chips 
without redundancy; in Section 8.4, we generalize these models for predicting the 
effects of redundancy on the yield. 

8.3.1 The Poisson and Compound Poisson 
Yield Models 

The most common statistical yield models appearing in the literature are the Pois¬ 
son model and its derivative, the Compound Poisson model. Although other mod¬ 
els have been suggested, we will concentrate here on this family of distributions, 
due to the ease of calculation when using them and the documented good fit of 
these distributions to empirical yield data. 

Let X denote the average number of faults occurring on the chip; in other words, 
the expected value of the random variable X. Assuming that the chip area is di¬ 
vided into a very large number, n, of small statistically independent subareas, each 
with a probability X/n of having a fault in it, we get the following Binomial prob¬ 
ability for the number of faults on the chip: 

ProbjX — k} — Probjfc faults occur on chip} 



Letting n —>• oo in Equation 8.9 results in the Poisson distribution 

e~ x X k 

Prob{X — k} — Prob{/c faults occur on chip} = —-- (8.10) 

k\ 

and the chip yield is equal to 

T c hi P = ProbfX = 0} = e~ l (8.11) 

Note that we use here the spatial (area dependent) Poisson distribution rather than 
the Poisson process in time discussed in Chapter 2. 

It has been known since the beginning of integrated circuit manufacturing that 
Equation 8.11 is too pessimistic and leads to predicted chip yields that are too 
low when extrapolated from the yield of smaller chips or single circuits. It later 
became clear that the lower-predicted yield was caused by the fact that defects, 
and consequently faults, do not occur independently in the different regions of the 
chip but rather tend to cluster more than is predicted by the Poisson distribution. 
Figure 8.2 demonstrates how increased clustering of faults can increase the yield. 
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(a) Non-clustered faults, Y c hip = 0.5 (b) Clustered faults, Y c hip = 0.7 


FIGURE 8.2 Effect of clustering on chip yield. I. Koren and Z. Koren, "Defect Tolerant VLSI 
Circuits: Techniques and Yield Analysis," Proceedings of the IEEE © 1998 IEEE. 


The same number of faults occur in both wafers, but the wafer on the right has a 
higher yield due to the tighter clustering. 

Clustering of faults implies that the assumption that subareas on the chip are 
statistically independent, which led to Equation 8.9 and consequently to Equa¬ 
tions 8.10 and 8.11, is an oversimplification. Several modifications to Equation 8.10 
have been proposed to account for fault clustering. The most commonly used 
modification is obtained by considering the parameter k in Equation 8.10 as a ran¬ 
dom variable rather than a constant. The resulting Compound Poisson distribution 
produces a distribution of faults in which the different subareas on the chip are 
correlated and which has a more pronounced clustering than that generated by 
the pure Poisson distribution. 

Let us now demonstrate this compounding procedure. Let k be the expected 
value of a random variable L with values i and a density function where 
fl(l)dl denotes the probability that the chip fault average lies between i and t+dl. 
Averaging (or compounding) Equation 8.10 with respect to this density function 
results in 

/•oo e -tpk 

ProbfX = k) — / —— / L (f) di (8.12) 

JO k! 

and the chip yield is 

/»OO 

Y chip = Prob{X = 0}= / e~ e f L (l) di (8.13) 

Jo 

The function/p(£) in this expression is known as the compounder or mixing function. 
Any compounder must satisfy the conditions 

POO nOO 

/ f L (i) di = i, E(L) — / ef L (t)de = k 
Jo Jo 

The most commonly used mixing function is the Gamma density function with 
the two parameters a and f 


m= 


A.“T(a) 


r -f e -xf 


(8.14) 
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where r(y) = / 0 °° e _ "M J ' _1 du (see Section 2.2). Evaluating the integral in Equa¬ 
tion 8.12 with respect to Equation 8.14 results in the widely used Negative Binomial 
yield formula 


and 


Prob{X = k} 


!>+*) (D* 

fc!r(a) ^ _|_ xy+k 


Vchip = Probjx = 0} 



(8.15) 


(8.16) 


This last model is also called the large-area clustering Negative Binomial model. It 
implies that the whole chip constitutes one unit and that subareas within the same 
chip are correlated with regard to faults. The Negative Binomial yield model has 
two parameters and is therefore flexible and easy to fit to actual data. The para¬ 
meter X is the average number of faults per chip, whereas the parameter a is a 
measure of the amount of fault clustering. Smaller values of a indicate increased 
clustering. Actual values for a typically range between 0.3 and 5. When a —> oo. 
Expression 8.16 becomes equal to Equation 8.11, which represents the yield un¬ 
der the Poisson distribution, characterized by a total absence of clustering. (Note 
that the Poisson distribution does not guarantee that the defects will be randomly 
spread out: all it says is that there is no inherent clustering. Clusters of defects can 
still form by chance in individual instances.) 


8.3.2 Variations on the Simple Yield Models 

The large-area clustering compound Poisson model described above makes two 
crucial assumptions: the fault clusters are large compared with the size of the chip, 
and they are of uniform size. In some cases, it is clear from observing the defect 
maps of manufactured wafers that the faults can be divided into two classes— 
heavily clustered and less heavily clustered (see Figure 8.3)—and clearly originate 
from two sources: systematic and random. In these cases, a simple yield model as 
described above will not be able to successfully describe the fault distribution. This 
inadequacy will be more noticeable when attempting to evaluate the yield of chips 
with redundancy. One way to deal with this is to include in the model a gross yield 
factor Yo that denotes the probability that the chip is not hit by a gross defect. Gross 
defects are usually the result of systematic processing problems that affect whole 
wafers or parts of wafers. They may be caused by misalignment, over- or under¬ 
etching or out-of-spec semiconductor parameters such as threshold voltage. It has 
been shown that even fault clusters with very high fault densities can be modeled 
by Yo- If the Negative Binomial yield model is used, then introducing a gross yield 
factor Yo results in 

X\~ a 

1 + - (8.17) 

«/ 


Ychip = Vo( 
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FIGURE 8.3 A wafer defect map. I. Koren and Z. Koren, "Defect Tolerant VLSI Circuits: Tech¬ 
niques and Yield Analysis," Proceedings of the IEEE © 1998 IEEE. 


As chips become larger, this approach becomes less practical since very few faults 
will hit the entire chip. Instead, two fault distributions, each with a different set 
of parameters, may be combined. X, the total number of faults on the chip, can 
be viewed as X = Xi + X 2 , where Xi and X 2 are statistically independent random 
variables, denoting the number of faults of type 1 and of type 2, respectively, on 
the chip. The probability function of X can be derived from 

k 

Prob{X = k} = ^2 ProbfXi = /} • Prob{X 2 — k—j} (8.18) 

;=0 


and 

Y c hi p = Prob{X = 0} = ProbfXi = 0} • Prob{X 2 = 0} (8.19) 

If Xi and X 2 are modeled by a Negative Binomial distribution with parameters 
and /. 2 , 0 / 2 , respectively, then 

- Ki 

Another variation on the simple fault distributions may occur in very large chips, 
in which the fault clusters appear to be of uniform size but are much smaller than 
the chip area. In this case, instead of viewing the chip as one entity for statisti¬ 
cal purposes, it can be viewed as consisting of statistically independent regions 
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called blocks. The number of faults in each block has a Negative Binomial distri¬ 
bution, and the faults within the area of the block are uniformly distributed. The 
large-area Negative Binomial distribution is a special case in which the whole chip 
constitutes one block. Another special case is the small-area Negative Binomial 
distribution, which describes very small independent fault clusters. Mathemati¬ 
cally, the medium-area Negative Binomial distribution can be obtained, similarly 
to the large-area case, as a Compound Poisson distribution, where the integration 
in Equation 8.12 is performed independently over the different regions of the chip. 
Let the chip consist of B blocks with an average of l faults. Each block will have 
an average of l / B faults, and according to the Poisson distribution, the chip yield 
will be 

Ychip = e~ e = ( e~ e/B ) B (8.21) 

where e~^ B is the yield of one block. 

When each factor in Equation 8.21 is compounded separately with respect to 
Equation 8.14, the result is 


Y 


chip — 




( 8 . 22 ) 


It is also possible that each region on the chip has a different sensitivity to defects, 
and thus, block i has the parameters cv„ resulting in 


B 

Ycwp= n 


i= 1 



(8.23) 


It is important to note that the differences among the various models described 
in this section become more noticeable when they are used to project the yield of 
chips with built-in redundancy. 

To estimate the parameters of a yield model, the "window method" is regularly 
used in the industry. Wafer maps that show the location of functioning and failing 
chips are analyzed using overlays with grids or windows. These windows contain 
some adjacent chip multiples (e.g., 1,2, and 4), and the yield for each such multiple 
is calculated. Values for the parameters Yq, X, and a are then determined by means 
of curve fitting. 


8.4 Yield Enhancement Through Redundancy 

In this section we describe several techniques to incorporate redundancy in the de¬ 
sign of VLSI circuits to increase the yield. We start by analyzing the yield enhance¬ 
ment due to redundancy, and then present schemes to introduce redundancy into 
memory and logic designs. 
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8.4.1 Yield Projection for Chips with Redundancy 

In many integrated circuit chips, identical blocks of circuits are often replicated. 
In memory chips, these are blocks of memory cells, which are also known as sub¬ 
arrays. In digital chips they are referred to as macros. We will use the term modules 
to include both these designations. 

In very large chips, if the whole chip is required to be fault-free, the yield will 
be very low. The yield can be increased by adding a few spare modules to the de¬ 
sign and accepting those chips that have the required number of fault-free mod¬ 
ules. However, adding redundant modules increases the chip area and reduces the 
number of chips that will fit into the wafer area. Consequently, a better measure 
for evaluating the benefit of redundancy is the effective yield, defined as 

ff Area of chip without redundancy 

chip chip ^ rea 0 j: c qjp w jth redundancy 

The maximum value of determines the optimal amount of redundancy to be 
incorporated into the chip. 

The yield of a chip with redundancy is the probability that it has enough fault- 
free modules for proper operation. To calculate this probability, a much more de¬ 
tailed statistical model than described earlier is needed, a model that specifies the 
fault distribution for any subarea of the chip, as well as the correlations among the 
different subareas of the chip. 

Chips with One Type of Modules 

For simplicity, let us first deal with projecting the yield of chips whose only cir¬ 
cuitry is N identical modules, out of which R are spares and at least M = N — R 
must be fault-free for proper operation. Define the following probability 

F/pj = ProbjExactly i out of the N modules are fault-free}. 

Then the yield of the chip is given by 

N 

^chip = ^2 Fi,N (8.25) 

i=M 

Using the spatial Poisson distribution implies that the average number of faults 
per module, denoted by X m , is X m = X/N. In addition, when using the Poisson 
model, the faults in any distinct subareas are statistically independent, and thus, 

F w =Q(e" A «) i (l-e- i ») N - i 

= Q(e-UN) ! (i- e -A / N)N-/ 


(8.26) 
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and the yield of the chip is 


Y, 


chip 


-±o 

i=M v 7 


,-k/Ny 


)'( 1 


~-X/N\N-i 


(8.27) 


Unfortunately although the Poisson distribution is mathematically convenient, 
it does not match actual defect and fault data. If any of the Compound Poisson dis¬ 
tributions is to be used, then the different modules on the chip are not statistically 
independent but rather correlated with respect to the number of faults. A simple 
formula such as Equation 8.27, which uses the Binomial distribution, is therefore 
not appropriate. Several approaches can be followed to calculate the yield in this 
case, all leading to the same final expression. 

The first approach applies only to the Compound Poisson models, and is based 
on compounding the expression in Equation 8.26 over X m (as shown in Sec¬ 
tion 8.3). Replacing X/N by t, expanding (1 — e~^) N ~ J into the binomial series 
Ylk=o ( — and substituting into Equation 8.26 results in 


By compounding Equation 8.28 with a density function we obtain 


Fi,N — 



l ) 1°° e ~ (i+k)t f L ^ dl 


Defining \j„ = / 0 °° e d£ (y n is the probability that a given subset of n modules 

is fault-free, according to the Compound Poisson model) results in 

F ^=n N f(-l) k ( N ;% +k (8.29) 

' 7 Jt=0 v 7 

and the yield of the chip is equal to 

Vchip = EE(- 1 ) t Q(V)yH* (8.30) 

i=Mk =0 \ / v / 


The Poisson model can be obtained as a special case by substituting 

Vi+k = e“ W /N , 


whereas for the Negative Binomial model 


Vi+k — 



(i + k)X 
Na 


—a 


(8.31) 
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The yield of the chip under this model is 


y 


N N-i 

cmp-EB - 1 ) 1 

i=M k =0 



(i + k)X\ 
Na ) 


(8.32) 


The approach described above to calculating the chip yield applies only to the 
Compound Poisson models. A more general approach involves using the Inclu¬ 
sion and Exclusion formula in order to calculate the probability F, ^ and results 
in: 

F,, N =C)B-l)T N -')y, +l (8.33) 

' ' k=0 ' ' 

which is the same expression as in Equation 8.29 which leads to Equation 8.30. 

Since Equation 8.30 can be obtained from the basic Inclusion and Exclusion for¬ 
mula, it is quite general and applies to a larger family of distributions than the 
Compound Poisson models. The only requirement for it to be applicable is that for 
a given n, any subset of n modules have the same probability of being fault-free, 
and no statistical independence among the modules is required. 

As shown above, the yield for any Compound Poisson distribution (including 
the pure Poisson) can be obtained from Equation 8.30 by substituting the appro¬ 
priate expression for y n . If a gross yield factor Yo exists, it can be included in y„. 
For the model in which the defects arise from two sources and the number of faults 
per chip, X, can be viewed as X = Xi + X?, 

(1) (2) 

Vn = y„ Yn 


where y® denotes the probability that a given subset of n modules has no type 
j faults {j — 1,2). The calculation of y n for the medium-size clustering Negative 
Binomial probability is slightly more complicated and a pointer to it is included in 
the Further Reading section. 

More Complex Designs 

The simple architecture analyzed in the preceding section is an idealization, be¬ 
cause actual chips rarely consist entirely of identical circuit modules. The more 
general case is that of a chip with multiple types of modules, each with its own re¬ 
dundancy. In addition, all chips include support circuits which are shared by the 
replicated modules. The support circuitry almost never has any redundancy and, 
if damaged, renders the chip unusable. In what follows, expressions for the yield 
of chips with two different types of modules, as well as some support circuits, are 
presented. The extension to a larger number of module types is straightforward 
but cumbersome and is therefore not included. 
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Denote by Nj the number of type j modules, out of which Rj are spares. Each 
type j module occupies an area of size fly on the chip (j = 1,2). The area of the sup¬ 
port circuitry is fl c k (ck stands for chip-kill, since any fault in the support circuitry 
is fatal for the chip). Clearly, N\a\ + N 2 CI 2 + fl c k = ^chip- 

Since each circuit type has a different sensitivity to defects, it has a different fault 
density. Let X nl] , a,„ 2 , and a^ denote the average number of faults per type 1 mod¬ 
ule, type 2 module, and the support circuitry, respectively. Denoting by F,y ^ r i 2/ w 2 
the probability that exactly i\ type 1 modules, exactly z '2 type 2 modules, and all 
the support circuits are fault-free, the chip yield is given by 


N\ N 2 

"^chip = Pi u Ni,i 2 ,N2 

i\=M\ i2=M.2 

where My = Nj — Rj (j = 1,2). According to the Poisson distribution. 


(8.34) 


'N 1 


^ 1^2 = 1 h ) (e _lm i ) !l (1 — e~ x ' n i ) 


-A,„, \Ni -h 


x ( N2 ] (e~ Xm 2) t2 (l - e - x ”‘ 2 ) N2 !2 e _A ck 


(8.35) 


To get the expression for under a general fault distribution, we need 

to use the two-dimensional Inclusion and Exclusion formula reulting in 


TV 1 — i\ Tv/ 2 —12 


Zl,TVi ,Z2/TV/2 




h\ {N 2 \ (N 2 - 12 


h =0 k 2 =0 


yi 1+ k u i 2 +k 2 

(8.36) 


where y ni ,n 2 is the probability that a given set of n 1 type 1 modules, a given set of 
/?2 type 2 modules, and the support circuitry are all fault-free. This probability can 
be calculated using any of the models described in Section 8.3 with X replaced by 

"b ^2^m 2 + k c ]<. 

Two noted special cases are the Poisson distribution, for which 

y ni ,n 2 = (e -Am i) ,!l (e -Am 2 )' , 2 e -Ack = e _( " lAm i+" 2 Am 2 + A ck) ( 8 . 37 ) 

and the large-area Negative Binomial distribution, for which 


y»i,«2— 



n\X mi + Ti2X m2 + A. c k 
a 


—a 


(8.38) 


Some chips have a very complex redundancy scheme that does not conform 
to the simple M-of-N redundancy. For such chips, it is extremely difficult to de¬ 
velop closed-form yield expressions for any model with clustered faults. One pos¬ 
sible solution is to use Monte Carlo simulation, in which faults are thrown at the 
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wafer according to the underlying statistical model, and the percentage of opera¬ 
tional chips is calculated. A much faster solution is to calculate the yield using the 
Poisson distribution, which is relatively easy (although complicated redundancy 
schemes may require some non-trivial combinatorial calculations). This yield is 
then compounded with respect to X using an appropriate compounder. If the Pois¬ 
son yield expression can be expanded into a power series in X, analytical integra¬ 
tion is possible. Otherwise, which is more likely, numerical integration has to be 
performed. This very powerful compounding procedure was employed to derive 
yield expressions for interconnection buses in VLSI chips, for partially good mem¬ 
ory chips, and for hybrid redundancy designs of memory chips. 

8.4.2 Memory Arrays with Redundancy 

Defect-tolerance techniques have been successfully applied to many designs of 
memory arrays due to their high regularity, which greatly simplifies the task of 
incorporating redundancy into their design. A variety of defect-tolerance tech¬ 
niques have been exploited in memory designs, from the simple technique us¬ 
ing spare rows and columns (also known as word lines and bit lines, respectively) 
through the use of error-correcting codes. These techniques have been successfully 
employed by many semiconductor manufacturers, resulting in significant yield 
improvements ranging from 30-fold increases in the yield of early prototypes to 
1.5-fold or even 3-fold yield increases in mature processes. 

The most common implementations of defect-tolerant memory arrays include 
redundant bit lines and word lines, as shown in Figure 8.4. The figure shows a 
memory array that was split into two subarrays (to avoid very long word and bit 
lines which may slow down the memory read and write operations) with spare 
rows and columns. A defective row, for example, or a row containing one or more 
defective memory cells can be disconnected by blowing a fusible link at the output 
of the corresponding decoder as shown in Figure 8.5. The disconnected row is then 
replaced by a spare row which has a programmable decoder with fusible links, 
allowing it to replace any defective row (see Figure 8.5). 



FIGURE 8.4 A memory array with spare rows and columns. 
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Spare 

row 


FIGURE 8.5 Standard and programmable decoders. 

The first designs that included spare rows and columns relied on laser fuses that 
impose a relatively large area overhead and require the use of special laser equip¬ 
ment to disconnect faulty lines and connect spare lines in their place. In recent 
years, laser fuses have been replaced by CMOS fuses, which can be programmed 
internally with no need for external laser equipment. Since any defect that may 
occur in the internal programming circuit will constitute a chip-kill defect, several 
memory designers have incorporated error-correcting codes into these program¬ 
ming circuits to increase their reliability. 

To determine which rows and columns should be disconnected and replaced 
by spare rows and columns, respectively, we first need to identify all the faulty 
memory cells. The memory must be thoroughly tested, and for each faulty cell, 
a decision has to be made as to whether the entire row or column should be dis¬ 
connected. In recent memory chip designs, the identification of faulty cells is done 
internally using Built-In Self-Testing (BIST), thus avoiding the need for external 
testing equipment. In more advanced designs, the reconfiguration of the memory 
array based on the results of the testing is also performed internally Implementing 
self-testing of the memory is quite straightforward and involves scanning sequen¬ 
tially all memory locations and writing and reading Os and Is into all the bits. The 
next step of determining how to assign spare rows and columns to replace all de¬ 
fective rows and columns is considerably more complicated because individual 
defective cells can be taken care of by either replacing the cell's row or the cell's 
column. An arbitrary assignment of spare rows and columns may lead to a situa¬ 
tion where the available spares are insufficient, while a different assignment may 
allow the complete repair of the memory array. 

To illustrate the complexity of this assignment problem, consider the 6x6 mem¬ 
ory array with two spare rows ( SRq and SR\) and two spare columns (SCo and 
SCi), shown in Figure 8.6. The array has 7 of its 36 cells defective, and we need 
to decide which rows and columns to disconnect and replace by spares to obtain 
a fully operational 6x6 array. Suppose we use a simple Roiu First assignment 
algorithm that calls for using all the available spare rows first and then the spare 
columns. For the array in Figure 8.6, we will first replace rows Rq and R\ by the two 
spare rows and be left with four defective cells. Because only two spare columns 
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FIGURE 8.6 A 6 x 6 memory array with two spare 

rows, two spare columns, and seven 

defective cells (marked by x). 
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FIGURE 8.7 The bipartite graph corresponding to the memory array in Figure 8.6. 


exist, the memory array is not repaired. As we will see below, a different assign¬ 
ment can repair the array using the available spare rows and columns. 

To devise a better algorithm for determining which rows and columns should 
be switched out and replaced by spares, we can use the bipartite graph shown 
in Figure 8.7. This graph contains two sets of vertices corresponding to the rows 
(Ro through R- ? ) and columns (Co through C 5 ) of the memory array and has an 
edge connecting R, to Cj if the cell at the intersection of row R, and column Cj is 
defective. Thus, to determine the smallest number of rows and columns that must 
be disconnected (and replaced by spares), we need to select the smallest number 
of vertices in Figure 8.7 required to cover all the edges (for each edge at least one 
of the two incident nodes must be selected). For the simple example in Figure 8.7, 
it is easy to see that we should select C 2 and R 5 to be replaced by a spare column 
and row, respectively, and then select one out of Co and R 3 and, similarly, one out 
of C 4 and Ro. 

This problem is known as bipartite graph edge covering and has been shown 
to be NP-complete. Therefore, there is currently no algorithm of polynomial com¬ 
plexity to solve the spare rows and columns assignment problem. We could restrict 
our designs to have, for example, spare rows only, which would considerably re- 



266 


CHAPTER 8 Defect Tolerance in VLSI Circuits 


duce the complexity of this problem. If only spare rows are available, we must 
replace every row with one or more defective cells by a spare row if one exists. 
This, however, is not a practical solution for two reasons. First, if two (or more) 
defects happen in a single column, we will need to use two (or more) spare rows 
instead of a single spare column (see for example, column C 2 in Figure 8.6), which 
would significantly increase the required number of spare rows. Second, a reason¬ 
ably common defect in memory arrays is a completely defective column (or row), 
which would be uncorrectable if no spare columns (or rows) are provided. 

As a result, many heuristics for the assignment of spare rows and columns have 
been developed and implemented. These heuristics rely on the fact that it is not 
necessary to find the minimum number of rows and columns that should be re¬ 
placed by spares, but only to find a feasible solution for repairing the array with 
the given number of spares. 

A simple assignment algorithm consists of two steps. The first identifies which 
rows (and columns) must be selected for replacement. A must-repair row is a row 
that contains a number of defective cells that is greater than the number of cur¬ 
rently available spare columns. Must-repair columns are defined similarly. For ex¬ 
ample, column C 2 in Figure 8.6 is a must-repair column because it contains three 
defective cells, whereas only two spare rows are available. Once such must-repair 
rows and columns are replaced by spares, the number of available spares is re¬ 
duced and other rows and columns may become must-repair. For example, after 
identifying C 2 as a must-repair column and replacing it by, say SCo, we are left 
with a single spare column, making row R$ a must-repair row. This process is con¬ 
tinued until no new must-repair rows and columns can be identified, yielding an 
array with sparse defects. 

Although the first step of identifying must-repair rows and columns is reason¬ 
ably simple, the second step is complicated. Fortunately, to achieve high perfor¬ 
mance, the size of memory arrays that have their own spare rows and columns 
is kept reasonably small (about 1 Mbit or less) and as a result, only a few defects 
remain to be taken care of in the second step of the algorithm. Consequently, even 
a very simple heuristic such as the above-mentioned row-first will work properly 
in most cases. In the example in Figure 8.6, after replacing the must-repair column 
C 2 and the must-repair row R$, we will replace Rq by the remaining spare row and 
then replace Co by the remaining spare column. A simple modification to the row- 
first algorithm that can improve its success rate is to first replace rows and columns 
with multiple defective cells and only then address the rows and columns which 
have a single defective cell. 

Even the yield of memory chips that use redundant rows and columns cannot 
be expected to reach 100%, especially during the early phases of manufacturing 
when the defect density is still high. Consequently, several manufacturers package 
and sell partially good chips instead of discarding them. Partially good chips are 
chips that have some but not all of their cell arrays operational, even after using 
all the redundant lines. 
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The embedding of large memory arrays in VLSI chips is becoming very com¬ 
mon with the most well-known example of large cache units in microprocessors. 
These large embedded memory arrays are designed with more aggressive design 
rules compared with the remaining logic units and, consequently, tend to be more 
prone to defects. As a result, most manufacturers of microprocessors include some 
form of redundancy in the cache designs, especially in the second level cache units, 
which normally have a larger size than the first level of caches. The incorporated 
redundancy can be in the form of spare rows, spare columns or spare subarrays. 

Advanced Redundancy Techniques 

The conventional redundancy technique (using spare rows and columns) can be 
enhanced, for example, by using an error-correcting code (ECC). Such an approach 
has been applied in the design of a 16-Mb DRAM chip. This chip includes four in¬ 
dependent subarrays with 16 redundant bit lines and 24 redundant word lines 
per subarray. In addition, for every 128 data bits, nine check bits were added to 
allow the correction of any single-bit error within these 137 bits (this is a (137,9) 
SEC/DED Hamming code; see Section 3.1). To reduce the probability of two or 
more faulty bits in the same word (e.g., due to clustered faults), every eight adja¬ 
cent bits in the subarray were assigned to eight separate words. It was found that 
the benefit of the combined strategy for yield enhancement was greater than the 
sum of the expected benefits of the two individual techniques. The reason is that 
the ECC technique is very effective against individual cell failures, whereas redun¬ 
dant rows and columns are very effective against several defective cells within the 
same row or column, as well as against completely defective rows and columns. 
As mentioned in Chapter 3, the ECC technique is commonly used in large mem¬ 
ory systems to protect against intermittent faults occurring while the memory is 
in operation, in order to increase its reliability. The reliability improvement due to 
the use of ECC was shown to be only slightly affected by the use of the check bits 
to correct defective memory cells. 

Increases in the size of memory chips in the last several years made it neces¬ 
sary to partition the memory array into several subarrays in order to decrease the 
current and reduce the access time by shortening the length of the bit and word 
lines. Using the conventional redundancy method implied that each subarray has 
its own spare rows and columns, leading to situations in which one subarray had 
an insufficient number of spare lines to handle local faults and other subarrays 
still had some unused spares. One obvious approach to resolve this problem is to 
turn some of the local redundant lines into global redundant lines, allowing for a 
more efficient use of the spares at the cost of higher silicon area overhead due to 
the larger number of required programmable fuses. 

Several other approaches for more efficient redundancy schemes have been de¬ 
veloped. One such approach was followed in the design of a 1-Gb DRAM. This de¬ 
sign used fewer redundant lines than the traditional technique, and the redundant 
lines were kept local. For added defect-tolerance, each subarray of size 256 Mb 
(a quarter of the chip) was fabricated in such a way that it could become part of 
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FIGURE 8.8 An 8" wafer containing 112 256-MByte subarrays. (The 16 subarrays marked 
with a circle would not be fabricated in an ordinary design.) 


up to four different memory ICs. The resulting wafer shown in Figure 8.8 includes 
112 such subarrays out of which 16 (marked by a circle in the figure) would not be 
fabricated in an ordinary design in which the chip boundaries are fixed. 

To allow this flexibility in determining the chip boundaries, the area of the sub¬ 
array had to be increased by 2%, but in order to keep the overall area of the subar¬ 
ray identical to that in the conventional design, row redundancy was eliminated, 
thus compensating for this increase. Column redundancy was still implemented. 

Yield analysis of the design in Figure 8.8 shows that if the faults are almost 
evenly distributed and the Poisson distribution can be used, there is almost no ad¬ 
vantage in using the new design compared to the conventional design with fixed 
chip boundaries and use of the conventional row and column redundancy tech¬ 
nique. There is, however, a considerable increase in yield if the medium-area Neg¬ 
ative Binomial distribution (described in Section 8.3.2) applies. The extent of the 
improvement in yield is very sensitive to the fabrication parameter values. 

Another approach for incorporating defect-tolerance into memory ICs com¬ 
bines row and column redundancy with several redundant subarrays that are to 
replace those subarrays hit by chip-kill faults. Such an approach was followed by 
the designers of another 1-Gbit memory which includes eight mats of size 128 Mbit 
each and eight redundant blocks of size 1 Mbit each (see Figure 8.9). The redun¬ 
dant block consists of four basic 256-Kbit arrays and has an additional eight spare 
rows and four spare columns (see Figure 8.10), the purpose of which is to increase 
the probability that the redundant block itself is operational and can be used for 
replacing a block with chip-kill faults. 

Every mat consists of 512 basic arrays of size 256 Kbit each and has 32 spare 
rows and 32 spare columns. Flowever, these are not global spares. Four spare rows 
are allocated to a 16-Mbit portion of the mat and eight spare columns are allocated 
to a 32-Mbit portion of the mat. 
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FIGURE 8.9 A 1-Gb chip with eight mats of size 128 Mbit each and eight redundant 
blocks (RB) of size 1 Mbit each. I. Koren and Z. Koren, "Defect Tolerant VLSI Circuits: Techniques 
and Yield Analysis," Proceedings of the IEEE © 1998 IEEE. 


8 redundant rows 


1 redundant column 


FIGURE 8.10 A redundant block including four 256-Kbit arrays, eight redundant rows, 
and four redundant columns. I. Koren and Z. Koren, "Defect Tolerant VLSI Circuits: Techniques 
and Yield Analysis," Proceedings of the IEEE © 1998 IEEE. 



FIGURE 8.11 Yield as a function of X for different numbers of redundant blocks per 
half chip (chip-kill probability = 5 x 10 -4 ). I. Koren and Z. Koren, "Defect Tolerant VLSI Circuits: 
Techniques and Yield Analysis," Proceedings of the IEEE © 1998 IEEE. 


The yield of this new design of a memory chip is compared to that of the tradi¬ 
tional design with only row and column redundancy in Figure 8.11, demonstrating 
the benefits of some amount of block redundancy. The increase in yield is much 
greater than the 2% area increase required for the redundant blocks. It can also be 
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shown that column redundancy is still beneficial even when redundant blocks are 
incorporated and that the optimal number of such redundant columns is indepen¬ 
dent of the number of spare blocks. 

8.4.3 Logic Integrated Circuits with Redundancy 

In contrast to memory arrays, very few logic ICs have been designed with any 
built-in redundancy Some regularity in the design is necessary if a low overhead 
for redundancy inclusion is desired. For completely irregular designs, duplication 
and even triplication are currently the only available redundancy techniques, and 
these are often impractical due to their large overhead. Regular circuits such as 
Programmable Logic Arrays (PLAs) and arrays of identical computing elements 
require less redundancy, and various defect-tolerance techniques have been pro¬ 
posed (and some implemented) in order to enhance their yield. These techniques, 
however, require extra circuits such as spare product terms (for PLAs), reconfig¬ 
uration switches, and additional input lines to allow the identification of faulty 
product terms. Unlike memory ICs in which all defective cells can be identified 
by applying external test patterns, the identification of defective elements in logic 
ICs (even for those with regular structure) is more complex and usually requires 
the addition of some built-in testing aids. Thus, testability must also be a factor in 
choosing defect-tolerant designs for logic ICs. 

The situation becomes even more complex in random logic circuits such as mi¬ 
croprocessors. When designing such circuits, it is necessary to partition the de¬ 
sign into separate components, preferably with each having a regular structure. 
Then, different redundancy schemes can be applied to the different components, 
including the possibility of no defect-tolerance in components for which the cost 
of incorporating redundancy becomes prohibitive. 

We describe next two examples of such designs: a defect-tolerant microproces¬ 
sor and a wafer-scale design. These demonstrate the feasibility of incorporating 
defect tolerance for yield enhancement in the design of processors and prove that 
the use of defect tolerance is not limited to the highly regular memory arrays. 

The Hyeti microprocessor is a 16-bit defect-tolerant microprocessor that was de¬ 
signed and fabricated to demonstrate the feasibility of a high-yield defect-tolerant 
microprocessor. This microprocessor may be used as the core of an application- 
specific microprocessor-based system that is integrated on a single chip. The large 
silicon area consumed by such a system would most certainly result in low yield, 
unless some defect tolerance in the form of redundancy were incorporated into 
the design. 

The data path of the microprocessor contains several functional units such as 
registers, an arithmetic and logic unit (ALU), and bus circuitry. Almost all the units 
in the data path have circuits that are replicated 16 times, leading to the classic bit- 
slice organization. This regular organization was exploited for yield enhancement 
by providing a spare slice that can replace a defective slice. Not all the circuits in 
the data path, though, consist of completely identical subcircuits. The status reg- 
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FIGURE 8.12 The effective yield as a function of the added area, without redundancy 
and with optimal redundancy, for the Negative Binomial distribution with X = 0.05/mm 2 

and a = 2. I. Koren and Z. Koren, "Defect Tolerant VLSI Circuits: Techniques and Yield Analysis," 
Proceedings of the IEEE © 1998 IEEE. 


ister, for example, has each bit associated with unique random logic and therefore 
has no added redundancy. 

The control part has been designed as a hardwired control circuit that can be 
implemented using PLAs only. The regular structure of a PLA allows a straight¬ 
forward incorporation of redundancy for yield enhancement through the addition 
of spare product terms. The design of the PLA has been modified to allow the 
identification of defective product terms. 

Yield analysis of this microprocessor has shown that the optimal redundancy 
for the data path is a single 1-bit slice and the optimal redundancy for all the PLAs 
is one product term. A higher-than-optimal redundancy has, however, been im¬ 
plemented in many of these PLAs, because the floorplan of the control unit allows 
for the addition of a few extra product terms to the PLAs with no area penalty. 
A practical yield analysis should take into consideration the exact floorplan of the 
chip and allow the addition of a limited amount of redundancy beyond the opti¬ 
mal amount. Still, not all the available area should be used up for spares, since this 
will increase the switching area, which will, in turn, increase the chip-kill area. 
This greater chip-kill area can, at some point, offset the yield increase resulting 
from the added redundancy. 

Figure 8.12 depicts the effective yield (see Equation 8.24) without redundancy 
in the microprocessor and with the optimal redundancy as a function of the area 
of the circuitry added to the microprocessor (which serves as a controller for that 
circuitry). The figure shows that an increase in yield of about 18% can be expected 
when the optimal amount of redundancy is incorporated in the design. 

A second experiment with defect-tolerance in nonmemory designs is the 3-D 
Computer, an example of a wafer-scale design. The 3-D Computer is a cellular ar¬ 
ray processor implemented in wafer scale integration technology. The most unique 
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feature of its implementation is its use of stacked wafers. The basic processing el¬ 
ement is divided into five functional units, each of which is implemented on a 
different wafer. Thus, each wafer contains only one type of functional unit and in¬ 
cludes spares for yield enhancement as explained below. Units in different wafers 
are connected vertically through microbridges between adjacent wafers to form a 
complete processing element. The first working prototype of the 3-D Computer 
was of size 32x32. The second prototype included 128x128 processing elements. 

Defect tolerance in each wafer is achieved through an interstitial redundancy 
scheme (see Section 4.2.3) in which the spare units are uniformly distributed in the 
array and are connected to the primary units with local and short interconnects. In 
the 32x32 prototype, a (1,1) redundancy scheme was used, and each primary unit 
had a separate spare unit. A (2,4) scheme was used in the 128 x 128 prototype; each 
primary unit is connected to two spare units, and each spare unit is connected to 
four primary units, resulting in a redundancy of 50% rather than the 100% for the 
(1,1) scheme. The (2,4) interstitial redundancy scheme can be implemented in a 
variety of ways. The exact implementation in the 3-D Computer and its effect on 
the yield are further discussed in the next section. 

Since it is highly unlikely that a fabricated wafer will be entirely fault-free, the 
yield of the processor would be zero if no redundancy were included. With the 
implemented redundancy, the observed yield of the 32 x 32 array after repair was 
45%. For the 128x128 array, the (1,1) redundancy scheme would have resulted 
in a very low yield (about 3%), due to the high probability of having faults in a 
primary unit and in its associated spare. The yield of the 128 x 128 array with the 
(2,4) scheme was projected to be much higher. 

8.4.4 Modifying the Floorplan 

The floorplan of a chip is normally not expected to have an impact on its yield. This 
is true for chips that are small and have a fault distribution that can be accurately 
described by either the Poisson or the Compound Poisson yield models with large- 
area clustering (in which the size of the fault clusters is larger than the size of the 
chip). 

The situation has changed with the introduction of integrated circuits with a 
total area of 2 cm 2 and up. Such chips usually consist of different component types, 
each with its own fault density, and have some incorporated redundancy. If chips 
with these attributes are hit by medium-sized fault clusters, then changes in the 
floorplan can affect their projected yield. 

Consider the following example, depicted in Figure 8.13, of a chip consisting 
of four equal-area modules (functional units). Mi, M 2 , M 3 , and M 4 . The chip has 
no incorporated redundancy, and all four modules are necessary for the proper 
operation of the chip. 

Assuming that the defect clusters are medium-sized relative to the chip size and 
that the four modules have different sensitivities to defects, we use the medium- 
area Negative Binomial distribution (described in Section 8.3.2) for the spatial dis- 
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FIGURE 8.13 Three floorplans of a 2x2 array. I. Koren and Z. Koren, "Defect Tolerant VLSI 
Circuits: Techniques and Yield Analysis," Proceedings of the IEEE © 1998 IEEE. 

tribution of faults, with parameters Xj (for module M,) and a (per block), and 

A-i ^ X 2 ^ A 3 ^ X 4 . 

This chip has 4! = 24 possible floorplans. Since rotation and reflection will not 
affect the yield, we are left with three distinct floorplans, shown in Figure 8.13. If 
small-area clustering (clusters smaller than or comparable to the size of a module) 
or large-area clustering (clusters larger than or equal to the chip area) is assumed, 
the projected yields of all possible floorplans will be the same. This is not the case, 
however, when medium-area clustering (with horizontal or vertical blocks of two 
modules) is assumed. 

Assuming horizontal defect blocks of size two modules, the yields of floorplans 
(a), ( b ), and (c) are 

Y(a) = Y(b) = (1 + (X 4 + X 2 )/a)~ a {l + (X 3 + X 4 )/a)~ a 

Y(c) = (1 + (X\ + X 4 )/a)~ a (l + (X 2 + X 3 )/a)~ a (8.39) 

A simple calculation shows that under the condition X\ ^ X 2 ^ A 3 ^ X 4 , floorplans 
(a) and (b) have the higher yield. Similarly, for vertical defect blocks of size two 
modules. 


Y(a) — Y(c) — (1 + (ki + X 3 )/a) (1 + (X 2 + X4)/ a) 

Y(b) — (1 + (kj + X 4 ) / ot) (1 + (X 2 + A3)/or) (8.40) 

and floorplans (a) and (c) have the higher yield. Thus, floorplan (a) is the one 
which maximizes the chip yield for any cluster size. An intuitive explanation for 
the choice of (a) is that the less sensitive modules are placed together, increasing 
the chance that the chip will survive a cluster of defects. 

If the previous chip is generalized to a 3x3 array (as depicted in Figure 8.14), 
and X\ ^ X 2 ^ ^ Xg, then, unfortunately, there is no one floorplan which is al¬ 

ways the best and the optimal floorplan depends on the cluster size. Flowever, the 
following generalizations can be made. 

For all cluster sizes, the module with the highest fault density (M 9 ) should be 
placed in the center of the chip, and each row or column should be rearranged so 
that its most sensitive module is in its center (such as, for example, floorplan ( b ) 
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FIGURE 8.14 Two floorplans of a 3x3 array. I. Koren and Z. Koren, "Defect Tolerant VLSI Cir¬ 
cuits: Techniques and Yield Analysis," Proceedings of the IEEE © 1998 IEEE. 
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FIGURE 8.15 Three alternative floorplans for a chip with redundancy. I. Koren and Z. Ko¬ 
ren, "Defect Tolerant VLSI Circuits: Techniques and Yield Analysis," Proceedings of the IEEE © 1998 
IEEE. 

in Figure 8.14). Note that we reached this conclusion without assuming that the 
boundaries of the chip are more prone to defects than its center. The intuitive ex¬ 
planation to this recommendation is that placing highly sensitive modules at the 
chip corners increases the probability that a single fault cluster will hit two or even 
four adjacent chips on the wafer. This is less likely to happen if the less sensitive 
modules are placed at the corners. 

The next example is that of a chip with redundancy. The chip consists of four 
modules, Mi, Si, M 2 , and S 2 , where Si is a spare for Mi and S 2 is a spare for M 2 . 
The three topologically distinct floorplans for this chip are shown in Figure 8.15. 
Let the number of faults have a medium-area Negative Binomial distribution with 
an average of X\ for Mi and Si, and X 2 for M 2 and S 2 , and a clustering parameter 
of a per block. Assuming that the defect clusters are horizontal and of size two 
modules each, the yields of the three floorplans are 

Y(a) = Y(c) = 2[1 + (M + + 2[1 + Ai/«]-“[l + X 2 / a] -0 ' 

- 2[1 + (M + X 2 )/a]~ a [ 1 + M/«r“ 

- 2[1 + (M + X 2 )/a]~ a [l + X 2 /a]- a + [l + (M + X 2 )/a]~ 2a (8.41) 

Y(b) = [2(1 + M/«)““ - (1 + 2M/«)-“] 
x [2(l + k 2 /«)““-(l+2 X 2 /a)~ a ] 


(8.42) 
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FIGURE 8.16 The original and alternative floorplans of a wafer in the 3-D Computer. 

I. Koren and Z. Koren, "Defect Tolerant VLSI Circuits: Techniques and Yield Analysis/' Proceedings of 
the IEEE © 1998 IEEE. 

It can be easily proved that for any values of /. i and /.o, Y(a) — Y(c) > Y(b). 

If, however, the defect clusters are vertical and of size two modules, then clearly, 
Y(a) is given by Equation 8.42 and Y(b) = Y(c) and are given by Equation 8.41. 
In this case, Y(b) —Y(c) ^ Y(a) for all values of X\ and A. 2 . Floorplan (c) should 
therefore be preferred over floorplans (a) and (b). An intuitive justification for the 
choice of floorplan (c) is that it guarantees the separation between the primary 
modules and their spares for any size and shape of the defect clusters. This results 
in a higher yield, since it is less likely that the same cluster will hit both the module 
and its spare, thus killing the chip. 

This last recommendation is exemplified by the design of the 3-D Computer, 
described in Subsection 8.4.3. The (2,4) structure that has been selected for imple¬ 
mentation in the 3-D Computer is shown in Figure 8.16a. 

This floorplan has every spare unit adjacent to the four primary units that it 
can replace. This layout has short interconnection links between the spare and any 
primary unit that it may replace, and as a result, the performance degradation 
upon a failure of a primary unit is minimal. However, the close proximity of the 
spare and primary units results in a low yield in the presence of clustered faults, 
since a single fault cluster may cover a primary unit and all of its spares. 

Several alternative floorplans can be designed that place the spare farther apart 
from the primary units connected to it (as recommended above). One such floor- 
plan is shown in Figure 8.16b. The projected yields of the 128x128 array using 
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FIGURE 8.17 The yield of the original and alternate floorplans, depicted in Figure 8.16, 
as a function of A. (a = 2). I. Koren and Z. Koren, "Defect Tolerant VLSI Circuits: Techniques and 
Yield Analysis," Proceedings of the IEEE © 1998 IEEE. 


the original floorplan (Figure 8.16a) or the alternative floorplan (Figure 8.16b) are 
shown in Figure 8.17. The yield has been calculated using the medium-area Neg¬ 
ative Binomial distribution with a defect block size of two rows of primary units 
(see Figure 8.16a). Figure 8.17 clearly shows that the alternative floorplan, in which 
the spare unit is separated from the primary units that it can replace, has a higher 
projected yield. 


8.5 Further Reading 

Several books (e.g., [8-10]), an edited collection of articles [5], and journal survey 
papers (e.g., [18,23,32,34,35,50]) have been devoted to the topic of this chapter. 
More specifically, for a detailed description of how critical areas and POFs can be 
calculated; see Chapter 5 in [10] as well as [47,53]. Two geometrical methods differ¬ 
ent from those mentioned in this chapter are the virtual artwork technique [33] and 
the Voronoi diagram approach [39]. Parametric faults resulting from variations in 
process parameters are described in [7,45]. 

Triangular and exponential density functions have been proposed as com¬ 
pounders in [36] and [42], respectively. The more commonly used (as a mixing 
function) Gamma distribution has been suggested in [38] and [46]. The "window 
method" that is used to estimate the parameters of yield models is described in 
[23,38,40,42,48]. It has been extended in [26] to include estimation of the block size 
for the medium-area clustering yield model. 
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Designs of defect-tolerant memories are described in [12-14,51,55,56]. The use 
of ECC is presented in [12]; the flexible chip boundaries scheme appears in [51] 
and the memory design with redundant subarrays is described in [56]. Some of 
these designs have been analyzed in [15,17,49]. Many techniques for assigning 
spare rows and columns to defective rows and columns in memory arrays have 
been developed, see for example [1,2,29,43]. Defect-tolerance techniques for logic 
circuits have been proposed (and some implemented) in [3,21,22,25,30,31,37,54, 
57], The Hyeti microprocessor is described and analyzed in [31] and the 3-D Com¬ 
puter is presented in [57]. 

The designers of many modern microprocessors have incorporated redundancy 
into the design of the embedded cache units. To determine the type of redundancy 
to be employed in the cache units of the PowerPC microprocessors, row, column, 
and subarray redundancies were compared considering the area and performance 
penalties and the expected yield improvements [52], Based on their analysis, the 
designers have decided to use row only redundancy for the level-1 cache unit and 
row and column redundancy for the level-2 cache unit. 

Intel's Pentium Pro processor incorporates redundancy in its 512-KByte level- 
2 cache [11]. This cache unit consists of 72 subarrays of 64-K memory cells each, 
organized into four quadrants, and a single redundant subarray has been added 
to every quadrant. The reported increase in yield due to the added redundancy is 
35%. This design includes a circuit for a BIST that identifies the faulty cells, and 
a flash memory circuit that is programmed to replace a defective subarray with a 
spare subarray. 

The two 64-KBytes cache unit in Hewlett-Packard's PA7300LC microprocessor 
have been designed with redundant columns. Four spare columns are included in 
a spare block that can be used to replace a faulty four-column block using multi¬ 
plexers that are controlled by programmable fuses. A BIST circuit is included to 
test the cache unit and identify the faulty block [28]. 

The spare rows and columns assignment algorithm used in the self-repair cir¬ 
cuit for the embedded memory unit in an Alpha microprocessor is described in [1[. 

The effect of floorplanning on yield has been analyzed in [16,19]. 


8.6 Exercises 

(c) 

1. Derive an expression for the critical area AA ss (n) for square u x u missing- 
material defects in a conductor of length L and width zv. Assume that one side 
of the defect is always parallel to the conductor, and that L zv so that the 
nonlinear edge effects can be ignored. 

2 . Use the polygon expansion technique to calculate approximately the critical 
area for circular short-circuit defects of diameter 3 for the 14x7 layout consist- 
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ing of two conductors shown below: 



(c) 

3. Find the average critical area Ah gs for circular missing-material defects in a 
single conductor of length L and width vo using the defect size distribution in 
Equation 8.1 with p — 3. Assume that L^>iv and ignore the nonlinear term in 
Equation 8.6. 

(c) 

4. a. Derive an expression for the critical area /l miss (x) of a circular missing- 

material defect with diameter x in the case of two conductors of length L, 
width vo and separation s (as shown in Figure 8.1). Ignore the nonlinear 
terms and note that the expression differs for the three cases: x < w; iv ^ 

x < 2w + s; and 2w + s < x ^ xm- 

(c) . 

b. Find the average critical area using the defect size distribution in 

Equation 8.1 with p = 3. For simplicity, assume I'm = oo. 

5. A chip with an area of 0.2 cm 2 (and no redundancy) is currently manufac¬ 
tured. This chip has a POF of 6 — 0.6 and an observed yield of Yi = 0.87. 
The manufacturer plans to fabricate a similar but larger chip, with an area of 
0.3 cm 2 , using the same wafer fabrication equipment. Assume that there is 
only one type of defects, and that the yield of both chips follows the Poisson 
model Y = e~ eAchi P d with the same POF 6 and the same defect density d. 

a. Calculate the defect density d and the projected yield Y 2 of the second 
chip. 

b. Let the area of the second chip be a variable A. Draw the graph of Y2, the 
yield of the second chip, as a function of A (for A between 0 and 2). 

6. A chip of area /l c | lip (without redundancy, and with one type of defects) is cur¬ 
rently manufactured at a yield of Y = 0.9. The manufacturer is examining the 
possibility of designing and fabricating two larger chips with areas of 2A c ^ p 
and 4A c hip. The designs and layouts of the new chips will be similar to those of 
the current chip (i.e., same 0), and the defect density d will remain the same. 

a. Calculate the expected yields of the two new chips assuming a Poisson 
model. 
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b. Calculate the expected yields of the two new chips assuming a Negative 
Binomial model with a — 1.5. 

c. Discuss the difference between the results of a and b. 

7. For a chip without redundancy assume that X, the number of faults on the 
chip, follows a compound Poisson distribution. 

a. Use as a compounder the triangular density function 


m= 


jz o^e^x 


and show that it results in the following expression for the chip yield: 


Ychip = Prob{X = 0}: 


e~ e /i(£)d£ = 


b. Now use as a compounder the exponential density function 


m = 


and show that it results in 


roo i 

Y chip = Prob{X = 0} = / e~ l f L (l) d £ = — 

Jo t + 


C. Compare the yield expressions in Equations 8.43 and 8.44 to those for the 
Poisson and Negative Binomial models (for chips without redundancy) by 
drawing the graph of the yield as a function of X for 0.001 Y X Y 1.5. For 
the Negative Binomial model, use three values of a, namely, a — 0.25,2, 
and 5. 

8 . Why does the spare row in Figure 8.5 include a fusible link? 

9. To a memory array with four rows and eight columns, a single spare row and 
two spare columns have been added. The testing of the memory array has 
identified four defective cells indicated by an x in the diagram below: 


X0000000 

00000x00 

00000000 

0x000x00 
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FIGURE 8.18 A 6x6 memory array with two spare rows, two spare columns, and nine 
defective cells (marked by an x). 


a. List two ways to reconfigure the memory array, i.e., which rows and 
columns will be disconnected and replaced by spares. 

b. Show a distribution of the four defective cells within the array for which 
the available spares will be insufficient. How many such distributions of 
four defective cells exist? 

C. Given that there are four defective cells and that they are randomly 
distributed over the array, what is the probability of such an irreparable 
distribution? 

10 . A 6 x 6 memory array with two spare rows and two spare columns is shown 
in Figure 8.18. Show the corresponding bipartite graph, identify all the must- 
repair rows and columns, and select additional rows /columns to cover the 
remaining defective cells. Will the column-first (row-first) algorithm, if ap¬ 
plied after replacing the must-repair rows (must-repair columns), be able to 
repair the memory array? 

11. A chip consists of five modules, out of which four are needed for proper op¬ 
eration and one is a spare. Suppose the fabrication process has a fault density 
of 0.7 faults/cm 2 , and the area of each module is 0.1 cm 2 . 

a. Calculate the expected yield of the chip using the Poisson model. 

b. Calculate the expected yield of the chip using the Negative Binomial 
model with a — 1. 

C. For each of the two models in parts (a) and (b), is the addition of the spare 
module to the chip beneficial from the point of view of the effective yield? 

d. Discuss the difference in the answer to (c) between the two models. 
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Fault Detection in 

Cryptographic 

Systems 


Cryptographic algorithms are being applied in an increasing number of devices 
to satisfy their high security requirements. Many of these devices require high¬ 
speed operation and include specialized hardware encryption and / or decryption 
circuits for the selected cryptographic algorithm. A unique characteristic of these 
circuits is their very high sensitivity to faults. Unlike ordinary arithmetic/logic cir¬ 
cuits such as adders and multipliers, even a single data bit fault in an encryption or 
decryption circuit will, in most cases, spread quickly and result in a totally scram¬ 
bled output (an almost random pattern). There is, therefore, a need to prevent such 
faults or, at the minimum, be able to detect them. 

There is another, even more compelling, reason for paying special attention to 
fault detection in cryptographic devices. The cryptographic algorithms (also called 
ciphers) that are being implemented are designed so that they are difficult to break. 
To obtain the secret key, which allows the decryption of encrypted information, an 
attacker must perform a prohibitively large number of experiments. However, it 
has been shown that by deliberately injecting faults into a cryptographic device 
and observing the corresponding outputs, the number of experiments needed to 
obtain the secret key can be drastically reduced. Thus, incorporating some form 
of fault detection into cryptographic devices is necessary for security purposes as 
well as for data integrity. 

We start this chapter with a brief overview of two important classes of ciphers, 
namely, symmetric key and asymmetric (or public) key, and describe the fault in- 
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jection attacks that can be mounted against them. We then present techniques that 
can be used to detect the injected faults in an attempt to foil the attacks. 

Overview of Ciphers 

Cryptographic algorithms use secret keys for encrypting the given data (known as 
plaintext) thus generating a ciphertext, and for decrypting the ciphertext to recon¬ 
struct the original plaintext. The keys used for the encryption and decryption steps 
can be either identical (or trivially related), leading to what are known as symmetric 
key ciphers, or different, leading to what are known as asymmetric key (or public key) 
ciphers. Symmetric key ciphers have simpler, and therefore faster, encryption and 
decryption processes compared with those of asymmetric key ciphers. The main 
weakness of symmetric key ciphers is the shared secret key, which may be sub¬ 
ject to discovery by an adversary and must therefore be changed periodically. The 
generation of new keys, commonly carried out using a pseudo-random-number 
generator (see Section 10.4), must be very carefully executed because, unless prop¬ 
erly initialized, such generators may result in easy to discover keys. The new keys 
must then be distributed securely, preferably by using a more secure (but also 
more computationally intensive) asymmetric key cipher. 

9.1.1 Symmetric Key Ciphers 

Symmetric key ciphers can be either block ciphers, which encrypt a block of a fixed 
number of plaintext bits at the same time, or stream ciphers, which encrypt 1 bit at 
a time. Block ciphers are more commonly used, and are therefore the focus of this 
chapter. 

Some well-known block cyphers include the Data Encryption Standard (DES) 
and the more recent Advanced Encryption Standard (AES). DES uses 64-bit plain¬ 
text blocks and a 56-bit key, whereas AES uses 128-bit blocks and keys of size 
between 128 and 196 bits. Longer secret keys are obviously more secure, but the 
size of the data block also plays a role in the security of the cipher. For example, 
smaller blocks may allow frequency-based attacks, such as relying on the higher 
frequency of the letter "e" in English-language text. 

Almost all symmetric key ciphers use the same key for encryption and for de¬ 
cryption. The process used for encryption must be reversible so that the reverse 
process followed during decryption can generate the original plaintext. The main 
objective of the encryption process is to scramble the plaintext as much as possible. 
This is done by repeating a computationally simple series of steps (called a round) 
several times to achieve the desired scrambling. 

The DES cipher follows the approach ascribed to Feistel. The Feistel scheme di¬ 
vides the block of plaintext bits into two parts Bi and fb. Bi is unchanged, whereas 
the bits in fb are added (using modulo-2 addition, which is the logical bit-wise 
Exclusive-OR (XOR) operation) to a one-way hash function F(B\,K), where K is 
the key. A hash function is a function that takes a long input string (in general. 
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of any length) and produces a fixed-length output string. A function is called a 
one-way hash function if it is hard to reverse the process and find an input string 
that will generate a given output value. The two subblocks Bi and f >2 + F(B\,K) 
are then swapped. 

These operations constitute a round, and the round is repeated several times. 
Following a round, we end up with B 1 = fh + F(B\,K) and B 0 = B\. A single round 
is not secure since the bits of Bj are unchanged and were only moved, but repeat¬ 
ing the round several times will considerably scramble the original plaintext. 

The one-way hash function F may seem to prevent decryption. Still, by the end 
of the round, both Bi and the key K are available and it is possible to recalculate 
F(B |, K) and thus obtain B 2 . Therefore, all the rounds can be "undone" in reverse 
order to retrieve the plaintext. 

DES has been the first official standard cipher for commercial purposes. It be¬ 
came a standard in 1976, and although there is currently a newer standard (AES 
established in 2002), the use of DES is still widespread either in its original form or 
in its more secure variation called Triple DES. Triple DES applies DES three times 
with different keys and offers as a result a higher level of security (one variation 
uses three different keys for a total of 168 bits instead of 56 bits, while another 
variation uses 112 bits). 

The Feistel-function-based structure of DES is shown in Figure 9.1. It consists 
of 16 identical rounds similar to the one described above. Each round first uses a 
Feistel function (the F block in the figure), performs the modulo-2 addition (the 
© circle in the figure), and then swaps the two halves. In addition, DES includes 
an initial and final permutations (see Figure 9.1) that are inverses and cancel each 
other. These do not provide any additional scrambling and were included to sim¬ 
plify loading blocks of data in the original hardware implementation. 

The 16 rounds use different 48-bit subkeys generated by a key schedule process 
shown in Figure 9.2. The original key has 64 bits, eight of which are parity bits, 
so the first step in the key schedule (the "Permuted Choice 1" in Figure 9.2) is to 
select 56 out of the 64 bits. The remaining 16 steps are similar: the 56 incoming bits 
are split into two 28-bit halves, and each half is rotated to the left by either one 
or two bits (specified for each step). Then, 24 bits from each half are selected by 
the "Permuted Choice 2" block to generate the 48-bit round subkey. As a result of 
the rotations, performed by the "<<<" block in the figure, a different set of bits is 
used in each subkey. 

The particular Feistel (hash) function used in DES is shown in Figure 9.3. It 
consists of four steps: 

1. Expansion. The 32 input bits are expanded to 48, using an expansion permuta¬ 
tion that duplicates some of the bits. 

2 . Adding a key. The 48-bit result is added (addition modulo-2 which is a bit-wise 

XOR operation) to a 48-bit sub key generated by the key schedule process. 




16 rounds 



FIGURE 9.1 The overall structure of the data encryption standard (DES). 


Key (64 bits) 



16 rounds 



FIGURE 9.2 The key schedule process for DES. 
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Half Data Block (32 bits) Sub-key (48 bits) 



FIGURE 9.3 The Feistel function in DES. 


3. Substitution. The 48-bit result of step 2 is divided into eight groups of 6 bits 
each, which are then processed by substitution boxes (called SBoxes). An SBox 
generates 4 bits according to a nonlinear transformation implemented as a 
lookup table. 

4. Permutation. The 32 bits generated by the eight SBoxes undergo a permutation. 

Two crucial properties that every good cipher must have are called confusion and 
diffusion. Confusion refers to establishing a complex relationship between the ci¬ 
phertext and the key, and diffusion implies that any natural redundancy that exists 
in the plaintext (and can be exploited by an adversary) will dissipate in the cipher- 
text. In DES, most of the confusion is provided by the SBoxes, and the expansion 
and permutation provide the diffusion. If the confusion and diffusion are done cor¬ 
rectly, a single bit change in the plaintext will cause every bit of the ciphertext to 
change with a probability of 0.5, independently of the others. 

In 1999, a specially designed circuit was successful in breaking a DES key in less 
than 24 hours, demonstrating that the security provided by the 56-bit key is weak. 
Consequently, Triple DES has been declared as the preferred cipher and was itself 
later replaced in 2002 by AES, described next. 

AES does not use a Feistel function; instead, it is based on substitutions and 
permutations, with most of its calculations being finite-field operations. AES uses 
blocks of 128-bit plaintext and three possible key sizes of 128, 192, or 256 bits. 
The 128-bit block is represented as a 4x4 array of bytes called the state, which is 
denoted by S with byte elements s(0 C i,j C 3). The state S is modified during 
each encryption round, until the final ciphertext is produced. Each round of the 
encryption process consists of four steps (see Figure 9.4): 
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FIGURE 9.4 The overall structure of the advanced encryption standard (AES). 


TABLE 9-1 The advanced encryption standard (AES) SBox: substitution 
values for the byte xy (in hexadecimal format) 


y 

x 0 1 23456789abcdef 


0 

63 

7c 

77 

7b 

f2 

6b 

6f 

c5 

30 

01 

67 

2b 

fe 

d7 

ab 

76 

1 

ca 

82 

c9 

7d 

fa 

59 

47 

fO 

ad 

d4 

a2 

af 

9c 

a4 

72 
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cb 
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fb 
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4d 
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85 

45 

f9 
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a3 
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8f 
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d2 

8 
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0c 

13 
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5f 
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7e 

3d 
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5d 

19 

73 

9 

60 

81 

4f 
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2a 

90 

88 

46 
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b8 

14 
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5e 
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a 

eO 
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3a 

0a 
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24 

5c 
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d3 
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91 
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e4 

79 

b 

e7 

c8 
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6d 

8d 
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4e 
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6c 
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f4 
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7a 

ae 
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c 

ba 

78 
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2e 

1c 
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b4 

c6 

e8 

dd 

74 

If 

4b 

bd 

8b 

8a 

d 

70 

3e 

b5 

66 

48 

03 

f6 

Oe 

61 

35 

57 

b9 

86 

cl 

Id 

9e 

e 

el 

f8 

98 

11 

69 

d9 

8e 

94 

9b 

1e 

87 

e9 

ce 

55 

28 

df 

f 

8c 

al 

89 

Od 

bf 

e6 

42 

68 

41 

99 

2d 

Of 

bO 

54 

bb 

16 


1. SubBytes. Each byte in the state matrix undergoes (independently of all other 
bytes) a nonlinear substitution of the form T(sJ V 1 ). Due to the complexity of this 
transformation, its 256 possible outcomes are (in almost all implementations of 
AES) precomputed and stored in an SBox lookup table. Unlike in DES, this is 
an 8- to 8-bit substitution (shown in Table 9-1) rather than a 6- to 4-bit one. The 
AES SBox has been designed to resist simple attacks. 
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2 . ShiftRows. The bytes of the first, second, third, and fourth rows of the state 
matrix are rotated by 0, 1, 2, and 3 bytes, respectively The state after this step 
is 


■so,o 

S0,1 

S0,2 

S0,3" 

Si,i 

Sl,2 

Sl,3 

Sl,0 

S2,2 

S2,3 

S2,0 

S2,l 

-S3,3 

S3,0 

s 3,l 

S3,2- 


( 9 . 1 ) 


so that every column of the matrix is now composed of bytes from all columns 
of the input matrix. 


3 . MixColumns. The four bytes in each column are used to generate four new 
bytes through linear transformations, as follows (j = 0,1,2,3) 


So,; = (a ® So,;) © (p ® Sy) 0 S 2/ j 0 sy 
Sy = So ,j 0 (a ® Sy) 0 (ft <g> Sy) © Sy 
s 2/ j = Sq j 0 sy 0 (a ® Sy) 0 (ft <S> S 3 ,'j) 

S 3 ,; = (P 0 So,;) © Si, ; - © S 2/ j © (a 0 s 3 ,y) (9.2) 


where a — x (or 02 in hexadecimal notation), p — x + 1 (or 03 in hexadecimal 
notation). <g> and 0 are the modulo-2 multiply and add operations, respectively, 
of the polynomial representations of the state bytes, and the a and f J > coeffi¬ 
cients. These operations are performed modulo the irreducible generator poly¬ 
nomial of AES, which is g(x) — x 8 + x 4 + x 3 + x + 1. Polynomial presentations 
of binary numbers and operations modulo a given generator polynomial have 
been discussed in Section 3.1. The MixColumns step together with ShiftRows 
provide the required diffusion in the AES cipher. 

4 . AddRoundKey. The round subkey is added (modulo-2) to the state. As in DES, 
separate round sub keys are generated using a key schedule process. 


All four steps are performed in nine out of the 10 rounds of a 128-bit key im¬ 
plementation, but in the 10th round, the MixColumns step is omitted. In addi¬ 
tion, prior to the first round, the first subkey is added to the original plaintext (see 
Figure 9.4). The round sub keys are either generated on-the-fly following the key 
schedule process shown in Figure 9.5 or are taken out of a lookup table that is filled 
up every time a new key is established. The total number of rounds is increased to 
12 and 14 for a 192-bit key and a 256-bit key AES, respectively. 


■ EXAMPLE 

A detailed example to illustrate the use of the AES algorithm (or for that mat¬ 
ter, any other symmetric key cipher such as DES) for even the smallest sizes of 
its parameters (number of bits in the key and plaintext) will be tedious and not 
very illuminating. We present therefore only some of the key steps of the ex- 
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KeyExpansion(byte key[4 * Nk], word w[ 4 * (Nr + l)],Nk) 
begin 

word temp 
i = 0 

while (i < Nk) 

w[i] = word(key[4 * i], key[4 * i + 1], key[4 * i + 2], key[4 * i + 3]) 
i = i + 1 
end while 
i = Nk 

while (i < 4 * (Nr + 1)) 
femp = iv[i — 1] 
if ( i mod Nk = 0) 

temp = SubWord(RotWord(fewzp)) xor Rcon[i/Nk] 
else if (Nk > 6 and i mod Nk — 4) 
temp = SubWord(femp) 
end if 

w\i] = w[z — Nk] xor temp 
i = i+ 1 
end while 

end 


FIGURE 9.5 The key schedule of AES (Nr = 10,12,14 is the number of rounds, 
Nk = 4,6,8 is the number of 32-bit words in the plaintext, and Rcon is an array of 
round constants, Rcon[j ] = 0d _1 , 00,00,00)). 


-32 

88 

31 

e0~ 

43 

5a 

31 

37 

/6 

30 

98 

07 

8 

8 d 

a2 

34. 


(a) Initial state matrix 


_ rf4 eO b 8 le~ 

27 bf b4 41 
11 98 5d 52 
_ ae fl e5 30 _ 

(d) After SubBytes 

-flO 88 23 2a~ 
fa 54 fl3 6c 
/e 2c 39 76 
.17 fcl 39 05. 

(g) The key added in 
round 2 


-2b 28 ab 09~ 

7e ae f7 cf 
15 d2 15 4f 
.16 z?6 88 3c _ 

(b) Key added in round 1 

■rf4 eO fo8 lc" 
bf M 41 27 
5rf 52 11 98 
_ 30 ae fl e5 _ 

(e) After ShiftRows 

-a4 68 6b 02- 
9c 9f 5b 6a 
7f 35 ea 50 
_/2 2b 43 49. 

(h) State matrix — end of 
round 2 


"19 aO 9a e9~ 

3d /4 c6 /8 
c3 e2 8d 48 
_ be 2b 2a 08 _ 

(c) State matrix — end of 
round 1 

-04 eO 48 28- 
66 cb /8 06 

81 19 d3 26 
_ e5 9 a 7a 4c _ 

(f) After MixColumns 

-39 02 dc 19" 

25 dc 11 6rt 
84 09 85 0b 
Ad fb 97 32. 

(i) State matrix — end of 
round 10 


FIGURE 9.6 Example illustrating the AES algorithm. 
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ample that appears in full detail in the official AES document (see the Further 
Reading section). 

Suppose the 128-bit plaintext is 

32 43/6 aS 88 5a 30 8d 313198 a2 eO 37 0734 
and the 128-bit key is 

2b7e\5 16 28 ae dl a6 abf7 15 88 09 cf 4/ 3c 

Both have 32 hexadecimal digits and are shown in a matrix format in Fig¬ 
ures 9.6a and b, respectively. The reader can verify that the byte-wise XOR 
operation of these two matrices yields the state matrix at the end of round 1, 
shown in Figure 9.6c. 

The first step in round 2 is SubBytes and its results are shown in Figure 9.6d. 
For example, the first byte in the state matrix was so,o = 19, and based on 
the corresponding entry in Table 9-1, it is replaced by dA. The second step 
is ShiftRows, and Figure 9.6e shows the results of rotating the first, second, 
third, and fourth rows of the matrix by 0, 1, 2, and 3 bytes, respectively. The 
next step is MixColumns, and its results are shown in Figure 9.6f. For example, 
the first byte in the state matrix is calculated based on Equation 9.2 as follows: 

S0,0 = (« ® So,o) © (P <E> Syo) © S2,0 © S3,o 

= (02 <g> dA) © (03 <g> bf) © 5d © 30 = 1&8 © lcl © 5d © 30 = 04 

Note that since the result is smaller than 100 (x 8 in polynomial notation), 
there is no need to further reduce it mod u lo-y(x) (recall that g(x) — x 8 + x 4 + 
x 3 + x + 1 is the generator polynomial of AES). 

The situation is different when calculating the second byte in the first col¬ 
umn. Here, 

Sl,0 = So,0 © (a ® Spo) © (/ <g> S2,o) © S3,0 

= dA © (02 <g> bf) © (03 <g> 5 d) © 30 = dA © 17e © e7 © 30 = 17d 

This value must be reduced modulo-y(x), and since 
x 8 mody(x) = x 4 + x 3 + x + 1 

we obtain 

17d mody(x) = 7d © (x 4 + x 3 + x + 1) = 7d © lb = 66 

which is the final value of the second byte in the first column in Figure 9.6f. 

We now need to calculate a new round key using the procedure in Fig¬ 
ure 9.5. The original key is first rewritten as the following four words: 

w[0] = 2b7el516, w[ 1] = 28aed2a6 

w[ 2] = abf71588, w[3] = 09c/4/3c 
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To calculate iv[ 4] (the first column in the key matrix for round 2), we start with 
temp — zv[i — 1] = zc[3] = 09c/4/3c 

Then, we rotate this word by 1 byte obtaining c/4/3c09. Next, we substitute 
each of the 4 bytes using the SubBytes transformation in Table 9-1, yielding 
8fl84e£>01. We then perform a bit-wise XOR operation with 

Rcon[ 1] = (x 1-1 ,00,00,00) = 01000000 

obtaining 8b84eb01. Finally, we calculate 

iv[i] — w[i — 4] xor temp = zc[0] xor 8£>84eK)l 
= 2b7el516 xor 8b8Aeb01 = a0fafel7 

This is the first column in the key matrix in Figure 9.6g. Adding the resulting 
key matrix to the state matrix, we obtain the new state matrix shown in Fig¬ 
ure 9.6h. Continuing this process for the remaining rounds (recall that in the 
last round the MixColumns step is skipped) results in the ciphertext 

39 25 84 Id 02 dc 09fb dc 1185 9719 6a 0b 32 

as shown in Figure 9.6i. 

If a single bit is changed in the plaintext, for example, instead of 
32 43/6 a8 88 5a 30 8d 313198 al eO 37 07 34 

we use 

30 43/6 a8 88 5a 30 8d 313198 a2 eO 37 07 34 
a very different ciphertext is obtained: 

cO 06 27 dl 8b d9 el 19 d517 6d be ba 73 37 cl 
Similarly, if a single bit is changed in the key, for example, instead of 
2b7el51628 ae d2 a6 abf7 158809 c/4/ 3c 

we use 

2a 7e 1516 28 ae d.2 a6 abf7 15 88 09 cf 4/ 3c 
the ciphertext produced is 

c4 6197 9e e4 4d e9 7a ba 52 34 8b 39 9d 7/ 84 

These two examples illustrate the fact that even a single-bit fault may result in 
a totally scrambled (almost random) output, demonstrating the significance of 
detecting such faults. ■ 
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9.1.2 Public Key Ciphers 

Unlike symmetric key ciphers, asymmetric key ciphers (also known as public key 
ciphers) allow users to communicate securely without having access to a shared 
secret key Public key ciphers are, however, considerably more computationally 
complex than symmetric key ciphers. Instead of a single key shared by the two 
entities communicating with each other, the sender and recipient each have two 
cryptographic keys called the public key and the private key. The private key is 
kept secret, and the public key may be widely distributed. In a way, one of the two 
keys can be used to "lock" a safe, whereas the other key is needed to unlock it. If a 
sender encrypts a message using the recipient's public key, only the recipient can 
decrypt it using the corresponding private key. 

Another noteworthy application of public key ciphers is sender authentication: 
the sender encrypts a message with her own private key. By managing to decrypt 
the message using the sender's public key, the recipient is assured that the sender 
(and no one else) generated the message. 

The best-known public key cipher is the RSA algorithm named after its three 
inventors Rivest, Shamir and Adleman, but other public key ciphers have been 
developed and are in use. Person A wishing to use the RSA cipher must first gen¬ 
erate a secret private key and a public key. The latter will be distributed to every¬ 
one who may wish to communicate with her. The key generation process consists 
of the following steps: 

1. Select two large prime numbers p and q, and calculate their product N — pq. 

2 . Select a small odd integer e that is relatively prime to 

0(N) = (p — 1)07-1) 

Two numbers (not necessarily primes) are said to be relatively prime if their 
only common factor is 1. For example, 6 and 25 are relatively prime, although 
neither is a prime number. 

3 . Find the integer d that satisfies 


de—1 mod <p(N ) 

(d is often called the "inverse" of e). 

The pair ( e,N ) constitutes the public key, and A should broadcast it to everyone 
who may wish to communicate with her. The pair ( d,N ) will serve as A's secret 
private key. The security provided by RSA depends on the difficulty of factoring 
the large integer N into its prime factors. Small integers can be factored in a rea¬ 
sonable amount of time, allowing the secret private key to be easily derived from 
the public key. To make the factoring time prohibitively large, each of the prime 
numbers p and q must have at least a hundred digits. 
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Given a message M that person B wishes to send to A, B will encrypt it using 
A's public key as 


S = M e modN 


Note that this encryption scheme makes it necessary to restrict the message M to 


(KM<N— 1 


Upon receiving the encrypted message S, A will decrypt it using her private key 
by calculating 

S d mod N = M de modN 

which can be shown to be equal to the original plaintext message M. The encryp¬ 
tion and decryption of RSA messages thus entail exponentiations modulo-N. 

Although there are techniques for reducing the complexity of such modular 
exponentiation (e.g., Montgomery reduction), the complexity of encryption and 
decryption for the RSA cipher is still considerably higher than that for symmetric 
key ciphers. 


■ EXAMPLE 

To illustrate the use of the RSA algorithm, consider the following simple 
example. Suppose we select the prime numbers p = 7 and q — 11, yielding 
N = 77 and <p(N) = 60. We can then select e = 7, which is obviously rela¬ 
tively prime with respect to <p(N). The pair (e,n) = (7,77) constitutes our pub¬ 
lic key. We search now for d that satisfies 7d = 1 mod 60, and find d = 43 
(since 7 • 43 = 301 = 1 mod60). Suppose now that B wishes to send us the 
message M = 9. B encrypts it using the public key (e, N) = (7,77), which we 
have given him, obtaining 9 7 mod 77 = 4782969 mod 77 = 37. We receive 37 
and decrypt it using our private key by calculating 37 43 mod 77, revealing the 
plaintext 9. ■ 


9.2 Security Attacks Through Fault Injection 

The level of security provided by the different ciphers has not been proved in an 
absolute sense, and all ciphers rely on the difficulty of finding the secret key di¬ 
rectly and having to resort to exhaustive searches which may take a prohibitive 
amount of time. However, attacks on cryptographic systems have been developed 
which take advantage of side-channel information. This is information that can 
be obtained from the physical implementation of a cipher rather than through ex¬ 
ploitation of some weakness of the cipher itself. One example of such side-channel 
information is the time needed to perform an encryption (or decryption), which in 
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certain implementations may depend on the bits of the key. This allows the at¬ 
tacker to narrow down the range of values which need to be attempted. Another 
example is the amount of power consumed in various steps of the encryption 
process: the power consumption profile of certain implementations may depend 
on whether the bits of the key are 0 or 1. 

Schemes to protect cryptosystems against such attacks have been developed. 
For example, a random number of instructions that do not perform any useful 
calculation can be injected into the code, scrambling the relationship between the 
bits in the key and the total time needed to complete the encryption (or decryp¬ 
tion). These randomly-injected instructions can also help protect against power 
measurements-based attacks. Other countermeasures that have been followed in¬ 
clude designs that have a data-independent delay or use dual-rail logic that con¬ 
sumes the same power independently of whether a particular bit is 1 or 0. Most 
such techniques incur delay and/or power penalties. 

An important type of side-channel attacks, which is of particular interest to us 
in this book, relies on the intentional injection of faults into a hardware implemen¬ 
tation of a cipher. Such attacks proved to be both easy to apply and very efficient; 
an attacker can guess the secret key after a very small number of fault injection 
experiments. This has been shown to be true for many types of ciphers, both sym¬ 
metric and asymmetric. 

The different techniques for injecting intentional faults into a cryptographic de¬ 
vice include varying the supply voltage (generating a spike), varying the clock fre¬ 
quency (generating a glitch), overheating the device, or, as is more commonly done, 
exposing the device to intense light using either a camera flash or a more precise 
laser (or X-ray) beam. 

Injecting a fault through a voltage spike or a clock glitch is likely to render a 
complete byte (or even several bytes) faulty, whereas the more precise laser or 
X-ray beams may be successful in inducing a single-bit fault. Fault-based attacks 
have been developed for both cases, and since most of these attacks induce tran¬ 
sient faults, they allow the attacker to repeat her attempts multiple times until 
sufficient information is collected for extracting the secret key and even use the 
device after breaking the cipher. 

A practical issue that must be considered when mounting a fault-based attack 
is the need for precise timing of the fault injection. To achieve the desired effect, 
the fault must be injected during a particular step of the encryption or decryp¬ 
tion algorithm. This turns out to be achievable in practice by analyzing the power 
and/or electromagnetic profile of the cryptographic device. 

We next describe briefly possible fault attacks on symmetric and asymmetric 
key ciphers. 

9.2.1 Fault Attacks on Symmetric Key Ciphers 

Various fault injection based attacks on DES have been described, two of which 
are presented next. 
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TABLE 9-2 Fault attack on data 
encryption standard (DES) 


DES Key Output 


K 0 ~- 

= XX XX XX 

XX 

XX 

XX 

XX 

XX 

So 

Ki = 

- XX XX XX 

XX 

XX 

XX 

XX 

00 

Si 

k 2 = 

- XX XX XX 

XX 

XX 

XX 

00 

00 

S2 

k 3 = 

- XX XX XX 

XX 

XX 

00 

00 

00 

S3 

K4 = 

- XX XX XX 

XX 

00 

00 

00 

00 

s 4 

K 5 = 

- XX XX XX 

00 

00 

00 

00 

00 

S5 

K 6 ~- 

- xx xx 00 

00 

00 

00 

00 

00 

S 6 

Ky = 

= xx 00 00 

00 

00 

00 

00 

00 

S7 


In cryptographic devices that use DES (e.g., smart cards), the secret key is often 
stored in an EEPROM and then transferred to the memory when a message needs 
to be encrypted or decrypted. If the attacker can reset an entire byte of the key 
(set the eight bits of that byte to zero) during its transfer from the EEPROM to 
the memory, he can figure out the secret key The attack consists of eight steps as 
outlined in Table 9-2. In all of these experiments, known (to the attacker) plaintext 
messages are encrypted with a different number of bytes of the key being forced 
to 0 as shown in Table 9-2. Based on the ciphertext Sy, the attacker can derive the 
first byte of the secret key by trying out all possible values of the first byte until the 
value that would produce Sy is found. Since in DES each byte of the key includes 
a parity bit, at most 128 values need to be checked rather than 256. In a similar 
manner, the second byte of the key can be found based on Sg. This procedure is 
continued until all eight bytes of the secret key are discovered. 

A second fault-based attack relies on causing an instruction to fail (most com¬ 
monly using clock glitches). For example, if the loop variable controlling the num¬ 
ber of times the basic round is executed is corrupted and, as a result, only one or 
two rounds are executed, the task of finding the secret key is greatly simplified. 

This type of attack can also be mounted against a device that uses AES and 
implements the cipher via software. Fault injection attacks on AES that focus, for 
example, on a byte of either the round subkey or on the state in the last round of 
the encryption have also been developed. Some of these attacks have been applied 
in practice to smart cards, yielding the secret key after fewer than 300 experiments. 
References to the descriptions of these attacks are provided in the Further Reading 
section. 

9.2.2 Fault Attacks on Public (Asymmetric) 

Key Ciphers 

Unlike symmetric key ciphers for which both encryption and decryption processes 
are vulnerable to security attacks, for a public key cipher, only the decryption 
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process may be subject to attacks attempting to extract the secret private key. One 
easily understood fault attack on the RSA decryption process assumes that the 
attacker can flip a randomly selected single bit of the private key d. Given an en¬ 
crypted message S and its corresponding plaintext M, both of which are known to 
the attacker, he flips a random bit of d. If the zth bit of d, d u is flipped to produce 
its complement d;, the decryption device will generate an erroneous plaintext M 
instead of M. The ratio between these two is 


M 

M 


Cj2'd; 

—— modN 

S 2 ‘ d i 


If this ratio is equal to S 2 ‘ modN for some i, the attacker can conclude that dj = 0. 
A ratio of -i-modN for some z implies that dj = 1. Repeating this process will 
eventually provide all the bits of the secret private key d. 

In a similar way, the bits of d can be obtained by flipping a bit in the ciphertext 
S, and even by flipping two (or more) bits simultaneously. Showing this is left as 
an exercise for the reader. This type of attack can therefore, be successful even if 
the attacker is unable to precisely flip a single bit. 


■ EXAMPLE 

Let us use the example discussed in Section 9.1.2 with ( e,N ) = (7,77) as the 
public key and d — 43 (or in binary = 101011) as the private key. 

Suppose the decryption device receives the ciphertext 37 and produces the 
plaintext M — 9 if no fault is injected, and the erroneous text M = 67 if a single 
bit fault is injected into d. We now search for z such that 9 = (67 • 37 2 ) mod 77. 
It is easy to verify that among the possible values of z, i — 3 is the one because 

(67 • 37 s ) mod 77 = (67 • 53) mod 77 = 9 

Consequently, we deduce that d$ — 1. ■ 


Countermeasures 

We presented above only a small sample out of the large number of possible fault- 
based attacks that can be mounted against cryptographic devices. Due to the rela¬ 
tive ease of applying these attacks, it is obvious that proper countermeasures must 
be taken in order to keep the devices secure. Any such countermeasure must first 
detect the fault, and then prevent the attacker from observing the output of the 
device after the fault has been injected. Either the output could be blocked (by 
producing a constant value such as all zeroes) or a random result generated, mis- 
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leading the attacker. Clearly, the original design of the device must be modified to 
include any such countermeasure. 

Two approaches can be followed when modifying the design of a cryptographic 
device to protect it against fault injection-based attacks. One relies on duplicating 
the encryption or decryption process (using either hardware or time redundancy) 
and comparing the two results. This approach assumes that the injected faults are 
transient and will not manifest themselves in exactly the same time in the two cal¬ 
culations. This approach is easy to apply but may, in certain situations, impose an 
overhead too high to be practical. The second approach is based on error-detection 
codes (see Section 3.1) that usually require a smaller overhead compared with 
brute-force duplication, although possibly at the cost of a lower fault coverage. 
Thus, a trade-off between the fault coverage and the hardware and / or time over¬ 
head should be expected. 

9.3.1 Spatial and Temporal Duplication 

Applying duplication to the encryption (or decryption) procedure is quite straight¬ 
forward. Spatial duplication requires redundant hardware to allow independent 
calculations so that faults injected into one hardware unit do not affect (in the same 
way) the other unit(s). Temporal redundancy can be applied by reusing the same 
hardware unit or re-executing the same software program, assuming that the man¬ 
ifestation of the injected faults will change from one execution to the other. These 
schemes are similar to the conventional hardware and time redundancy tech¬ 
niques that are described in Chapter 2. The recalculation with shifted/modified 
operands techniques that have been described in Section 5.2.4 can be used here to 
prevent the possibility of both computations being affected by the injected fault in 
exactly the same way. 

A different scheme for applying duplication relies on having a separate hard¬ 
ware unit or software program for executing the reverse procedure. For example, 
after completing the encryption, the decryption unit or program is applied to the 
ciphertext, and only if the result of the decryption is equal to the original plaintext 
is the ciphertext considered fault-free and is output. 

The latter approach is costly if applied to an RSA decryption device. The de¬ 
crypted result M obtained from the received encrypted message S is verified 
by calculating S — M e modN and comparing S to S. This calculation is time- 
consuming if the public key e is very large. 

9.3.2 Error-Detecting Codes 

This section illustrates the use of error-detecting codes (EDCs) for detecting faults 
in the encryption process of symmetric key ciphers. Similar rules apply to using 
EDCs during the decryption and key schedule procedures, because these use the 
same basic mathematical operations as the encryption. 
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Input text 



Intermediate or final Ciphertext Error Predicted check bits 


FIGURE 9.7 The general structure for detecting faults in encryption devices using 
error-detecting codes. 


When using an EDC during the encryption process, check bits are first gener¬ 
ated for the input plaintext, then for each operation(s) that the data bits undergo, 
the check bits of the expected result are predicted. Periodically, check bits for the 
actual result are generated and compared with the predicted check bits, and a 
fault is detected if the two sets do not match. The general approach is depicted 
in Figure 9.7. The validation checks can be scheduled at various granularities of 
the encryption, be it after every operation applied to the data, after each round, or 
only once at the end of the encryption process. 

The first step, that of generating the check bits for the plaintext, is straight¬ 
forward. The difficult part is devising the prediction rules for the new values of 
the check bits after each transformation that the data bits undergo during the en¬ 
cryption process. The complexity of these prediction rules, combined with the fre¬ 
quency at which the comparison is made, determines the overhead of applying 
the EDC, instead of duplication, as a protection against fault attacks. 

Various EDCs have been proposed for symmetric and public-key ciphers, most 
of them being the traditional EDCs described in Chapter 3. In particular, parity- 
based EDCs were found to be effective for the DES and AES symmetric ciphers. 
Parity bits can be associated with entire 32-bit words, with individual bytes, or 
even with nibbles (sets of 4 bits), with each such scheme providing a different 
fault coverage and entailing a different overhead in terms of extra hardware and 
delay. 

As an example, we illustrate the procedure for developing parity prediction 
rules when using a parity-based EDC for the AES cipher. Since most data trans¬ 
formations performed in the AES cipher operate on bytes, the natural choice is 
assigning a parity bit to each byte of the state. This will simplify the prediction 
rules and provide a high fault coverage. We discuss next the prediction rules for 
the four steps included in each round. 
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The prediction of the output parity bits for the ShiftRows transformation is 
straightforward: it is a rotated version of the input parity bits, following Equa¬ 
tion 9.1. 

Equally simple is the prediction of the output parity bits of the AddRoundKey 
step: it consists of adding modulo-2 the input parity matrix associated with the 
state to the parity matrix associated with the current round key. 

The SubBytes step uses SBox lookup tables where each SBox is usually imple¬ 
mented as a 256x8 bits memory. The input to the SBox will already have an associ¬ 
ated parity bit. To generate the outgoing parity, a parity bit can be stored with each 
data byte, increasing the number of bits in each location in the SBox to 9. To make 
sure that input parity errors are not discarded, we will have to check the parity of 
the input data and, if an error is detected, stop the encryption process. This would 
add hardware overhead (parity checkers for 16 bytes) and extra delay. 

A better choice would be to propagate the input parity errors so that they can be 
detected later on. This can be achieved by including the incoming parity bit when 
addressing the SBox, thus further increasing the table size to 512x9. The entries 
that correspond to input bytes with correct parity will include the appropriate 
SubBytes transformation result, with a correct parity bit. The other entries will 
contain a deliberately incorrect result, such as an all-zeroes byte with an incorrect 
parity bit. 

If fault attacks on the SBox address decoder can be expected, the above scheme 
is insufficient. In this case, we can add a small and separate table of size 256x1, 
which will include the predicted parity bit for the correct output byte. This sep¬ 
arate table will only allow detection of a mismatch between the parity bit of the 
correct output byte and the parity bit of the incorrect (but with a valid parity) out¬ 
put byte. We can increase the detection capabilities of this scheme by adding one 
(or more) correct output data bits to each location in the small table, thus increas¬ 
ing its size. Comparing the output of this table to the appropriate output bits of 
the main SBox table allows the detection of most addressing circuitry faults. 

The output parity bits of the MixColumns step are the most complex to predict. 
As the reader is requested to verify in the Exercises, the equations for predicting 
these parity bits are as follows: 

P0,j = P0,j © P2,j © P3,j © © sg 

Pl,j = P 0 ,j © P\,j © P3,j © sg © sg 
P2,j = P0,j © Pl,j © P2,j © »g © sg 

P3,j = Pl,j © P2,j © P3,j © sg © sg ( 9 . 3 ) 

(7) 

where p h j is the parity bit associated with state byte Spj, and s) ■ is the most signifi¬ 
cant bit of Sjj. 
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The question that remains is the granularity at which the comparisons between 
the generated and predicted parity bits will be made. Scheduling one validation 
check at the end of the whole encryption process has the obvious advantage of 
having the lowest overhead in terms of hardware and extra delay. Theoretically, 
this could result in the error indication being masked during the encryption pro¬ 
cedure, yielding a match between the generated and predicted parity bits despite 
the ciphertext being erroneous. It can be shown, however, that errors injected at 
any step of the AES encryption procedure will not be masked, and therefore, a 
single validation check of the final ciphertext is sufficient for error-detection pur¬ 
poses. 

Still, not every combination of errors can be detected by this scheme. Parity- 
based EDCs are capable of detecting any fault that consists of an odd number 
of bit errors; an even number of bit errors occurring in a single byte will not be 
detected. Moreover, if errors are injected in both the state and the round key, some 
data faults of odd cardinality will not be detected, for example, a single bit error in 
the round key and a single bit error in the state, occurring in matching bytes which 
are added in the AddRoundKeys step. The reason we do not restrict our discussion 
to single bit error coverage (as is usually done when benign faults are considered) 
is that when a malicious fault injection attack takes place, it most likely impacts 
multiple adjacent bits of the state and/or round key. Still, although we cannot 
expect a 100% fault coverage when using a parity-based EDC, the fault coverage 
has been shown to be very high, even when multiple faults are considered. 

Parity-based EDCs are suitable for the DES cipher as well, but the situation here 
is different from that with AES, due to two of the internal operations in the DES 
encryption process, namely, the expansion (from 32 to 48 bits) and the permuta¬ 
tion of the 32 bits. The latter permutation is irregular, and therefore, there is no 
simple way to predict the individual parity bits of the four bytes. A more practical 
solution is to verify the correctness of the permutation by duplicating the circuit 
and comparing the results. In addition, if we wish to detect faults in the remaining 
steps of the encryption using a parity-based EDC, we must schedule a validation 
checkpoint within each round prior to the permutation and generate new parity 
bits afterward. A simple way to overcome the complexity of parity prediction for 
the 32-bit permutation is to use a single parity bit per 32-bit word. This, however, 
yields a very low fault coverage and is not recommended. 

In a similar way, EDCs can be developed for other symmetric key ciphers. Sev¬ 
eral such ciphers that rely on modular addition and multiplication will better 
match residue codes (see Chapter 3). Other symmetric ciphers have been shown to 
require a very expensive implementation of EDCs, leading to the conclusion that 
the brute-force duplication is probably a more suitable solution. The cost of pro¬ 
viding protection against fault-based attacks should be taken into account when 
selecting a cipher for a device. 

The RSA public key cipher is based on modular arithmetic operations, and as 
such, it suggests the residue code as a natural choice. First, the check bits for the 
plaintext are generated based on the selected modulus C for the residue check 
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Decryption_Algorithm_l(S, N, (d n _\, d n _ 2 ,... ,d g)) 
begin 
a = S 

for i from n — 2 to 0 do 
a = a 2 mod N 

if dj — 1 then n = S ■ a mod N 

end 
M = a 


FIGURE 9.8 A straightforward decryption algorithm for RSA. 


(Mmod C where M is the original message). Since all operations performed during 
the RSA encryption (and decryption) are modular ones, we can apply them to the 
input check bits and obtain the predicted output check bits. The residue check will 
fail to detect an error if the faulty ciphertext has the same residue check bits as the 
correct one. Assuming that the fault injected is random, this match will happen 
with a probability of 1 /C, and thus, a higher value of C will result in a higher fault 
coverage (but also a higher overhead). 


9.3.3 Are These Countermeasures Sufficient? 

The objective of the countermeasures described above is to detect any fault in¬ 
jected during the process of encryption or decryption, and when such faults are 
detected, prevent the transmission of the erroneous results that may assist the at¬ 
tacker in extracting the secret key. Unfortunately, it has been demonstrated that 
although the detection of faults is necessary, it is not always sufficient for protect¬ 
ing against fault-based attacks. We illustrate this point through two examples: an 
RSA decryption and an AES encryption. 

Suppose we use for the RSA decryption a straightforward algorithm that con¬ 
sists of raising the input S to the power d (where d is the private key) as shown in 
Figure 9.8. The inputs to this algorithm are the encrypted message S, the modulus 
N, and the n-bit private key d = d n -i,d n - 2 , ■ ■ ■ ,do- 


■ EXAMPLE 

Assume a 4-bit private key (d 3 ,d 2 ,di,do) = (1011) (the decimal 11). The algo¬ 
rithm in Figure 9.8 will calculate M = ((S 2 ) 2 • S) 2 • S — S 11 . ■ 


Fault attacks on this algorithm can be detected either by using a residue code 
or by calculating M 1 ' mod N and comparing the result to S. Even with either of 
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Decryption_Algorithm_2(S,N, ( d n _\,d n _ 2 , ■ ■ . No)) 
begin 

a = S 

for i from n — 2 to 0 do 
a = a 2 modN 
b = S • a modN 
if d,- = 1 then a = b else a = a 

end 

if (ho error has been detected) then M = a 

end 


FIGURE 9.9 A modified decryption algorithm for RSA. 

these detection techniques, the algorithm is vulnerable to a power analysis-based 
attack because a step where dj = 0 will consume less power than a step for which 
d( — 1. To counter such an attack, the algorithm can be modified so that the power 
consumed in every step will be independent of d,. The modified algorithm shown 
in Figure 9.9 will, as expected, incur higher delay and power penalties compared 
to the original algorithm. The check at the end of the algorithm intends to make 
the algorithm resistant to fault injections. 

However, a careful examination of the algorithm in Figure 9.9 reveals that it 
is still vulnerable to fault-based attacks. Since the result b of the multiplication 
S • a modN is not used if dj — 0, the attacker can inject a fault during this multipli¬ 
cation, and if the final result of the decryption is correct, he can deduce one bit of 
the secret private key. 

Fortunately, a different algorithm can be devised using what is called a Mont¬ 
gomery ladder, as shown in Figure 9.10. In this algorithm, the intermediate values 
of both a and b are used in the next step, and thus, a fault injected in any interme¬ 
diate step will yield an erroneous result which will be detected. 


■ EXAMPLE 

Assume, as before, a 4-bit private key (d 3 ,d?,di,do) = (1011). The algorithm in 
Figure 9.10 will calculate M as follows. For i — 3, — 1, and thus, a = S and 

b — S 2 . For i — 2,d2 — 0, and thus, a — S 2 and b — S 3 . For i = l,d\ = 1, and thus, 
a — S 5 and b — S 6 . Finally, for i — 0,d\ = 1, resulting in M — a — S 11 and b — S 12 . 


The Montgomery-ladder-based decryption algorithm for RSA allows another 
approach to detect faults injected during the decryption. The computed a and b 
must be of the form (M, SM), and a fault injected during any intermediate step will 
destroy this relationship. Thus, checking whether a and b satisfy this relationship 
before providing the final result of the decryption can detect all injected errors. 
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Decryption_Algorithm_3(S, N, (d n _\, d n _ 2 ,... ,d g)) 
begin 
a = 1 
b = S 

for i from n — 1 to 0 do 
if dj = 0 then 
a = a 2 modN 
fo = a ■ frmodN 
end 

if dj — 1 then 

(7 = a -b mod N 
b = Z> 2 mod N 
end 

end 

if (no error has been detected) then M = a 


FIGURE 9.10 A Montgomery-ladder-based decryption algorithm for RSA. 


except those that modify either the bits of the secret private key d or the number 
of times the loop in Figure 9.10 is performed. By using some EDC for these two, 
in addition to verifying the relationship between a and b, all injected faults can be 
detected. 

We next describe a fault-based attack on AES encryption that may succeed 
even if a fault-detection mechanism that prevents erroneous results from being 
output is incorporated into the design. The attack starts with providing an all¬ 
zeroes input to the AES encryption device. In the very first step of the encryp¬ 
tion (see Figure 9.4), the initial round key is added, resulting in the state matrix 
s ij = 0 © kjj — kjj, where 0 < i,j < 3. At exactly the same time instant, before the 
first SubBytes operation, the attacker injects a fault into the I:th bit (i —0,1,... ,7) 
of a particular byte Sj r j of the state matrix so that the selected bit is set to 0. If the 
corresponding bit of the key (bit i of k,y) is 1, the output will be incorrect and the 
detection mechanism will disallow this output. If, however, the corresponding bit 
of the key is 0, no error will occur and the encryption device will work properly, 
providing the attacker with the value of that bit of the secret key. 

This attack is very simple to understand theoretically but may prove to be 
quite difficult to mount due to the need for precise timing and location of the 
injected fault. The secret key can still be extracted even if the strict timing and 
location requirements of this attack are relaxed, but this may require a larger num¬ 
ber of fault injection experiments. The interested reader can find further details 
in the original paper referenced in the Further Reading section. The simple at¬ 
tack described above shows that implementations of symmetric key ciphers, even 
those with fault detection capabilities, are not completely immune to fault-based 
attacks. 
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9.3.4 Final Comment 

A final remark is in order: the topic of this chapter is still a very active area of re¬ 
search and a constant stream of new fault-based attacks on cryptographic devices, 
and of novel countermeasures to protect the devices against these attacks appears 
in the literature. The objective of this chapter is to demonstrate the extra difficul¬ 
ties in devising fault-protection techniques to deal with malicious faults injected 
into cryptographic devices. 

9.4 Further Reading 

The official descriptions of the DES and AES algorithms appear in [24] and [25], 
respectively. The AES example that is outlined in Section 9.1.1 is detailed in [25]. 
A more detailed description of AES appears in [13]. The RSA algorithm was first 
described in [27]. Javascript AES, DES, and RSA calculators / demonstrators show¬ 
ing intermediate values are available [29]. A considerable number of articles on all 
aspects of cryptography are posted on the Website of the International Association 
for Cryptologic Research [17], Well-written descriptions of key terms in cryptog¬ 
raphy appear in the online encyclopedia Wikipedia [30]. 

Fault injection attacks were first discussed in [7], Many other fault attacks 
on public and symmetric key ciphers have been later presented in [1,2,9,12,14, 
16,26,32]. A survey of various fault injection techniques is provided in [3] which 
also reviews some protection schemes against such attacks. Detailed descriptions 
of ways to protect ciphers from attacks appear in [4,5,8,11,19-21,23,28]. The deriva¬ 
tion of the parity bit prediction rules for AES follows [4], Simulators for error de¬ 
tection in several ciphers are available online [22], The insufficiency of fault detec¬ 
tion schemes against fault-based attacks on RSA and AES has been demonstrated 
in [8,31]. The modified RSA decryption algorithm based on the Montgomery lad¬ 
der is described in [15,18]. New fault injection attacks and countermeasures appear 
in [10], 

9.5 Exercises 

1. Construct an RSA encryption scheme using p — 61 and q = 53. Select the pub¬ 
lic key e — 17, which is obviously relatively prime to 4>(pq). Find the corre¬ 
sponding private key d, and for the message M = 123, calculate the ciphertext 
and show that the private key allows the decryption of the ciphertext. 

2 . Develop a software implementation of DES (or find one on the Internet) and 
apply the fault-based attack shown in Table 9-2. Modify the program to inject 
the faults, and write another program to find the secret key. 

3 . Complete the example (in the chapter) of injecting a fault into the private key 
d of an RSA decryption device that uses the public key (e,N) = (7,77) and 
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the private key ( d,N ) = (43,77). Assume a ciphertext of 37 as in the example. 
List all possible single-bit errors and all double-bit errors that can be injected 
into d. For each error on your list find the erroneous plaintext that the device 
will produce. Are all the erroneous plaintexts unique? 

4 . Develop a software implementation of RSA (or find one on the Internet), use 
the prime numbers p — 7 and q — 11 as in the example in this chapter and select 
e — 7. This yields the public key (e,n) — (7,77) and the private key (d,n) — 
(43,77). Inject single-bit failures in your program, and obtain all the bits of the 
private key. 

5 . Use the program and parameters from Problem 4 and add a residue check 
with the modulus 3. Repeat the single-bit fault attacks. Will the modified pro¬ 
gram detect all such faults? 

6. Show that x s mod g{x) = x 2 3 4 + x 3 * + x + 1 for the generator polynomial of AES 
g(x) = x 8 + x 4 + x 3 + x + 1. 

7 . Verify all 16 results of the MixColumns step that are shown in Figure 9.6f. 

8. Inject a single-bit error in the state matrix shown in Figure 9.6c, replacing 
the first byte 19 by 18, and calculate the erroneous state matrix at the end 
of round 2. Compare your result to the matrix shown in Figure 9.6h. How 
many bytes are in error? 

9 . Suppose you are using AES with data blocks and key of size 128 bits. Your 
messages however are only 50-bit long. What would you put in the unused 
78-bit positions? 

10. Verify the correctness of the parity prediction equations for the MixColumns 
step in AES. 
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Simulation 

Techniques 


This chapter introduces the reader to statistical simulation approaches for eval¬ 
uating the reliability and associated attributes of fault-tolerant computer systems. 

Simulation is frequently used when analytical approaches are either not feasible 
or not sufficiently accurate. Simulation, in general, has a deep theoretical founda¬ 
tion in statistics that can take years to master, and to which many books have been 
devoted. However, learning to write a basic simulation program and to use the 
fundamental statistical tools for analyzing the output data is much easier. These 
basic techniques are what we concentrate on in this chapter. Having said that, this 
chapter is meant primarily for readers with a reasonably strong understanding of 
probability theory. 

We start by explaining how to write a simulation program. We then show how 
the output can be analyzed to deduce the system attributes. We then consider 
ways in which the results can be made more accurate by reducing the variance 
of the simulation output. We end the chapter by considering a different kind of 
simulation—fault injection, which is an experimental technique to characterize a 
system's response to faults. 

10.1 Writing a Simulation Program 

When faced with the need to construct a simulation model, one has three options: 

■ Write a program in a high-level general programming language, such as C, 
Java, or C++. 

■ Use a special-purpose simulation language such as SIMPSCRIPT, GPSS, or 
SIMAN. 
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■ Use or modify an available simulation package that has been designed to 
simulate such systems. Examples include SimpleScalar for computer archi¬ 
tectures and OPNET for network simulation. 

In this section, we will focus on the first option. Readers wishing to follow one of 
the other approaches should consult the user's manual of the chosen language or 
package. 

The most common form of simulation programs is a discrete-event simulation 
in which the events of interest (changes in the state variables) occur at discrete 
instants of time. Most events of interest in fault-tolerant computing such as the 
arrival of jobs at a computer system, error occurrence, the failure of a processor, 
and its recovery or replacement, are discrete events. By contrast, the flow of water 
out of a leaky bucket is an example of a continuous-event system: the state variable 
(water level) is a continuous function of time at the macro level. Of course, if one 
were to consider it at an atomic level, this would become a discrete-event system 
as the molecules of water leak one by one out of the bucket. This is an example of 
a situation in which what is continuous at one level of granularity turns out to be 
discrete at a finer level. 

Let us illustrate the simulation process by an example, after which we will ex¬ 
tract some general principles of the approach. 


■ EXAMPLE 


Suppose we wish to simulate the mean time to data loss (MTTDL) of a RAID 
Level 1 disk system. This system is so simple that good analytical models exist 
for its analysis and we do not really need a simulation model to obtain the 
MTTDLs. Still, this will be a good warm-up exercise in writing simulation 
programs. Also, the simulation can be used when the analytical model breaks 
down due to its limiting assumptions that do not always apply in practice 
(e.g., when the failures deviate significantly from a Poisson process). 

RAID Level 1 systems have been covered in Chapter 2: recall that the system 
consists of two mirrored disks, and that data loss occurs when the second disk 
fails before the first failed disk has been repaired. 

We start by identifying the events of interest to us: these are th e failures and 
repair actions. Suppose failures occur as a Poisson process with rate X, and re¬ 
pair time is a random variable R with density function /r(-)- Assuming that 
the parameters of the failure process and repair time distributions are known 
to us, we can generate failure and repair times using a random number gener¬ 
ator, as described later in Section 10.4. We show in Section 10.2 how the input 
parameters can be estimated if they are not given to us. 

The key data structure in the simulation is a linked list called the event 
chain, which holds the scheduled events (in this case, disk failure and repair 
instants) in temporal (meaning time) order. We also define a variable called 
the clock, which keeps the current simulated time and has an initial value of 0. 
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The simulation consists of advancing the clock from one event to the next, 
recording statistics as we go. The flowchart for the simulation is shown in Fig¬ 
ure 10.1. One point of detail is worth mentioning. Since the granularity of the 
time being measured is not infinitely fine (owing to the finite word length of 
the computer), it is possible—although highly improbable—that we will have 
two events: a disk failure and a repair completion (of the other disk) sched¬ 
uled for the same instant in the event chain. In this case, we must decide in 
which order the events will be inserted in the event chain. For example, we 
may decide that the failure event goes in first and the repair completion next. 
Let us illustrate the operation of the algorithm in Figure 10.1. We begin by 
generating first-failure epochs for the two disks: suppose they happen at times 
28 and 95, respectively. At time 0, the system state is (Up, Up), representing the 
condition of the two disks. The event chain now is 

(28,dl,F)«*(95,d2,F) 

where the three elements in the 3-tuple indicate the epoch of the event, the disk 
in question (dl or d2), and the event (F for failure and R for repair completion). 
The clock is now advanced to the next event in the event chain which occurs 
at time 28. The event is the failure of the first disk, and the system state now 
is (Down, Up). Generate a repair time for this disk: suppose the length of the 
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generated repair time is 10, and the disk will be up again at time 38. Remove 
the event that we just processed from the event chain, and insert the repair 
event into the event chain: 


(38,dl,R) o- (95,d2,F) 

Advance to the next event in the chain, at time 38. At this point, the first disk 
is back up, so the state of the system is (Up, Up). Generate the next failure time 
of this disk: suppose this failure is 68 units into the future, which means that 
the failure will happen at time 38 + 68 = 106. The event chain is now 

(95,d2,F) o- (106,dl,F) 

Advance to the next event at time 95. The system state now is (Up, Down). 
Generate the repair time of this disk: suppose it is 14, so that this disk will 
come up at time 95 + 14 = 109. The event chain is now 

(106,dl,F)4*(109,d2,R) 

Advance to the next event, at time 106. The system state is now (Dozvn, Doivn), 
representing data loss. For this simulation run, time to data loss (TTDL) is 106, 
and a new simulation run can begin. After all the runs are completed, the 
MTTDL of the system is estimated by calculating the average of the TTDLs of 
all the runs. If desired, a confidence interval for the MTTDL can be constructed 
as shown later in Section 10.2.5. ■ 


More complex simulations require more work, but the principle is the same. We 
create an event chain that is ordered temporally and advance from one event to the 
next, recording statistics appropriately. One has to be extremely careful to ensure 
that all events are captured in the event chain and that the simulation does not 
skip over any of them. 

The following are the key steps to follow when writing a simulation program: 

■ Thoroughly understand the system being simulated. Not doing so can 
result in a wrong system being modeled. 

■ List the events of interest. 

■ Determine the dependencies between events, if any. 

■ Understand the state transitions. 

■ Correctly estimate the distributions of the various input random variables. 

■ Identify the statistics to be gathered. 

■ Correctly analyze the output statistics to extract the required system 
attributes. 
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10.2 Parameter Estimation 

To run a simulation program, the values of certain input parameters are needed, 
such as failure and repair rates. In addition, we need a way of analyzing the 
simulation output and extracting parameters such as reliability and mean time 
to system failure. In this section we will see how such parameter values can be 
estimated. We will distinguish between point estimation and interval estimation, 
describe three methods by which to obtain point estimates of parameter values, 
and show how a confidence interval for the parameter can be constructed. Most 
of our discussion assumes that we know the underlying shape of the distribution 
that the data will follow and that this shape depends on one or more parameters 
whose exact value is unknown to us. For example, we may believe that processors 
fail according to a Poisson process, which we can characterize by estimating the 
rate, X, of this process. In some cases, we will estimate parameters even without 
knowledge of the exact shape of the distribution, using approximating formulas 
(most notably, the Central Limit Theorem). 


10.2.1 Point Versus Interval Estimation 

Suppose we are given a random variable X with a known distribution function 
characterized by an unknown parameter 9. To estimate 9, we either sample or 
simulate n independent observations of X, denoted by X\ r .. . ,X„, and use a suit¬ 
able function T(Xj,..., X„) as an estimator of 9. Since we will very likely not obtain 
the exact value of 9, we denote the estimate by 9. Note that 0 is a random vari¬ 
able and will have a different value if a different sample X\,... , X„ is selected. In 
what follows, we denote the expectation of a random variable X by E(X) and its 
variance by Var(X). Recall that the standard deviation of X (commonly denoted 
by a(X )) is the square root of the variance. We would like an estimator to be 
unbiased. 

Definition. An estimator 9 = T(X i,...,X n ) is called an unbiased estimator of a 
parameter 9 if E(9) = E(T(X\,. ..,X n )) — 9. 

Even if the estimator is unbiased, the likelihood that our point estimate is exactly 
equal to the real parameter is practically zero, although the difference between 
them is likely to diminish as n increases. We can characterize the confidence in 
our estimate by calculating an interval in which the parameter is expected to lie. 
This is interval estimation, and the resulting interval is called a confidence interval. 
The wider the interval, the greater is the likelihood that it includes the actual pa¬ 
rameter but the less informative it is. The next three sections discuss methods of 
obtaining point estimators, and Section 10.2.5 deals with constructing confidence 
intervals. 
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10.2.2 Method of Moments 


Suppose we want to estimate the values of m parameters of the probability distri¬ 
bution of some random variable, X. We define the /th distribution moment as E(X1) 
(j — 1,2,...). We then sample or simulate n independent observations of X, namely, 
Xi,..., X,„ and define the /th sample moment, Mj, as 


We now equate the first m distribution moments with the first m sample moments: 

E(X1) = Mj (j=l,...,m) 

The left-hand sides include the m parameters as unknowns, and so we have m 
equations, the solution of which yields estimators of these parameters. 

Let us consider some examples. 


■ EXAMPLE 

Suppose we believe that the running time, X, of a task has a normal distribu¬ 
tion with two parameters /i and ct 2 whose values we do not know. We execute 
the task n times and record the running times X\,... ,X n . Since // = E(X) and 
a 2 — var(X) = E(X — p) 2 — £(X 2 ) — (£(X)) 2 , we can use the Method of Mo¬ 
ments to write the two equations for our estimates, jx and a 2 , of the mean and 
variance, respectively: 


, ^ X 1 +X 2 + ---+X„ 

ji — X— - 

n 


and 


b 2 = 




- ^ 


EUri 


f ,2 e;Li(x,-x) 2 


Although X is an unbiased estimate of //, a 2 is not an unbiased estimate of ct 2 . 
As shown in almost any basic book on statistics, a small correction will result 
in an unbiased estimator for ct 2 : 


>2 U=1 (Xj-X) 2 


( 10 . 1 ) 


When n is large (as it is in most engineering experiments), there is no signifi¬ 
cant numerical difference between dividing by n or by n — 1. ■ 
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■ EXAMPLE 

Suppose we know that the lifetime, X, of a processor is exponentially distrib¬ 
uted, but do not know the value of the parameter, k, of that distribution. The 
density function for the processor lifetime is 

f(x) = ke~ Xx , x > 0 


We have one unknown and therefore need just one equation. We start with n 
processors and run them until they all fail. Let X/ be the lifetime of proces¬ 
sor i. Then, our estimate of the first moment of the processor lifetime (its mean 
value) is the sample average X. Since E(X) — \ //., we end up with the equation 



and therefore, 

- 1 
k= 

X 


Although X is an unbiased estimator of 1 / k, 1 /X is not an unbiased estimator 
of k. Still, it is often a good estimate. ■ 


■ EXAMPLE 

Suppose, instead, that X follows a Weibull distribution. Recall that X has the 
density function 

f(x) = kfix^~ x e~ XxP (x ^ 0) (10.2) 

The two parameters of this distribution are k and ft, so we need two equations 
to solve for these two unknowns. We obtain these equations by writing out 
expressions for the first two moments: £(X) and E(X 2 ): 

E(X) = k~ 1 r( 1 + 1/P) 

E(X 2 ) = k~ 2 r(l+2/p) 

where T(y) = /q°° d u is the Gamma function (see Section 2.2). We can 

therefore write 


k~ 1 r(i + i/p) = x 

V" X 2 

k~ 2 r(i+2/j3)= — =1 ' 


n 
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We have two equations in the two unknowns X and ft, which we can solve to 
obtain the estimates X and ft. ■ 


The method of moments is a fairly simple approach which often works rea¬ 
sonably well, although, as we have seen, it does not always result in unbiased 
estimators. Still, we can generalize and say that the sample average X is always 
used as an estimate for the expected value E(X). 


10.2.3 Method of Maximum Likelihood 

The maximum likelihood method determines parameter values for which the 
given observations would have the highest probability Given a set of observations, 
we set up a likelihood function, which expresses how likely it is that we obtain the 
observed values of the random variable, as a function of the parameter values. We 
then find those values of the parameters for which this function is maximized. 


■ EXAMPLE 

We believe that the intervals between failures of a certain system are exponen¬ 
tially distributed, with parameter X. Further, these intervals are independent 
of one another. From experimental observation of the system we obtain the 
following five values for the interfailure intervals: 10, 5,11,12,15. 

The joint density function of these five observations is the product of the 
individual observations, since these were made independently of one another. 
This joint density, conditioned on the parameter being X, is the likelihood func¬ 
tion, L(X): 


L(X) = ke“ m • ke“ 5A • ke“ lu • ke“ m • ke“ 15A = k 5 e“ 53A 

We now seek that value of X which will maximize L(X). We can do this using 
basic calculus: 

^ = (5k 4 -53k 5 )e- 53A = C) 
dX v ' 

Solving for X yields X — 0, 5/53. 

Clearly, X — 0 is a minimum while X — 5/53 is a maximum. Flence, our es¬ 
timate of X based on this set of observations is X — 5/53. (Note that this is 
equal to the Method of Moments estimate for the same parameter, which is 
X — 1/X = l/(53/5) = 5/53.) ■ 
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■ EXAMPLE 

Suppose now that we believe that the interfailure times are distributed ac¬ 
cording to the Weibull distribution, which has the probability density function 
shown in Equation 10.2, and we have to estimate the two parameters A. and 0, 
using the same five observations as in the previous example. 

The likelihood function is now given by 

L(Kt 0 =/(10) -/(5)./(ll)./(12) -/(15) 

= A 5 / S 5 10 /i “ 1 5 /i “ 1 ll /i “ 1 12 /i “ 1 15 /3 “ 1 e“ A(10/i+5 ' i+11/i+12/i+15/i) 

When attempting to maximize a function like this, it is easier to proceed by 
maximizing In L(X,0) rather than L(X, p) itself. Since ln(x) is a monotonically 
increasing function of x, this will lead to the same values for a, 0. Now, 

In L(X,0) = 5 In A + 5 In /f + (0 - l)(ln99000) - A(10^ + 5^ + 11^ + 12 p + 15^) 
= 5 In A. + 5 In /l + 11.5(0 - 1) -A(10^ + 5^ + 11^ + 12^ + 15^) 

To find X, 0, we differentiate the log-likelihood with respect to X and 0 and 
equate the derivatives to zero: 


31nL(k,/i) 

dX 

d\nL(X,0) 

d0 


This yields the equations 

5k -1 = 10^ + 5 P + 11^ + 12^ + 15^ 

50- 1 + 11.5 = A(10^ ln(10) + 5^ ln(5) + 11^ ln(ll) + 12^ ln(12) + 15^ ln(15)) 
These equations can now be solved to obtain the values of X and 0. ■ 


We now turn to the issue of experiments which are concluded before they are 
truly complete. For instance, suppose we are conducting experiments to obtain 
processor lifetime data. We may have a certain time-limit to our experiment: at 
that point, we terminate data collection even if not all the processors under test 
have failed yet. When using such experiments to estimate parameter values, we 
have to take into account the premature termination of the experiment. We do this 
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by multiplying the joint density of the completed observations by the probability 
that the non-failed units have lifetimes exceeding the experimental time-limit. 


■ EXAMPLE 

We carry out experiments to estimate the lifetime of a processor. We believe 
that the processor lifetime (measured in hours) follows an exponential distrib¬ 
ution, with parameter // whose value we are seeking to estimate. The density 
function for the processor lifetime is 

f(x) = qe _/iJ , x > 0 

and the cumulative probability distribution function is 


F(x) = 1 - e~^ x 


We start with a total of 10 processors and impose a time limit of 1000 hours 
on our experiment. That is, our experiment will end when 1000 hours have 
elapsed or all the processors have failed (whichever occurs sooner). 

Suppose our observations are that four failures occurred before the experi¬ 
ment is terminated, at times 700,800,900,950 hours. The remaining six proces¬ 
sors have lifetimes exceeding 1000 hours. 

The likelihood function for the whole sample is given by 

L(/i) =/(700)/(800)/(900)/(950)(l - F(1000)) 6 

_ ^4 e —/j(700+800+900+950) e —6000/t 

= mV 9350 " 

We find jl that maximizes the likelihood function by getting the derivative of 
L and equating it to zero, 

^ = (4/x 3 -9350 M 4 )e- 935 °' t = 0 

afi 

which results in /z = 0; 4.3 x 10 -4 . 

The maximum likelihood estimate is therefore // = 4.3 x 10 -4 . ■ 


If we terminate the experiment prematurely, we lose information and the qual¬ 
ity of the estimate is likely to suffer. This is shown in the following example. 
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■ EXAMPLE 

Consider again the previous example, except that we decide to set the time¬ 
limit of our experiment at some relatively small T, say T = 500 hours. Based 
on the measurements in the previous example, no failures will have occurred 
over this interval. Applying the maximum likelihood method, we seek the 
value of /x which maximizes 

L(/x) = (1 - F(T)) 10 = (e-" T ) 10 = e _10/iT 

The maximum likelihood estimate resulting from our experiment is /x = 0, 
which translates to a prediction that the processor lifetimes are infinite. This 
result is, of course, ludicrous; however, it is the best that we can extract from 
the maximum likelihood approach and the observation that no failures have 
occurred. ■ 


The maximum likelihood approach can also be used when the data are not ob¬ 
served exactly but are only known to lie in some interval. Once again, this is prob¬ 
ably best explained through an example. 


■ EXAMPLE 

Similarly to the previous examples, we have 10 processors whose lifetime of 
X days is exponentially distributed with an unknown parameter /x. The units 
operate in some remote location, and we can only check on their status at 
11 AM every day. We observe the first failure on the 50th day, the second on 
the 120th day, and the third on the 200th day, at which point the experiment 
concludes. 

When we observe a failure at 11 AM on day i, it means that the lifetime of 
the processor was greater than i — 1 days but less than i days. The probability 
of such a failure is therefore equal to 


qi = F(i) - F(i - 1) = e _(i_1)# * - 

The likelihood function associated with our observations is then given by 

£(h) = <750<7i20<720o(e _20 ° M ) 7 

We can now find the value of /x which maximizes this likelihood function. ■ 


The greater these sampling intervals, the worse is likely to be our estimate. In¬ 
deed, if the time-intervals are too coarse, the maximal likelihood method will 
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make ridiculous predictions. Consider the following modification to our previous 
example. 


■ EXAMPLE 

Consider a situation in which the processors are checked every T days, for 
some large T (say T = 300). Suppose we find, on the very first check, that all 
10 processors have failed: this means that all 10 have had lifetimes less than T 
days. 

The likelihood function associated with this observation is 
L(/r)=(F(T)) 10 =(l-e-" T ) 10 

The value of // that maximizes this function is p. = oo; our estimate is thus that 
the average processor lifetime is zero! What this means is that T was set so 
high that we were not able to obtain much information from checking after T 
days. ■ 


10.2.4 The Bayesian Approach to Parameter 
Estimation 

The Bayesian approach relies on Bayes's formula for reversing conditional prob¬ 
ability, and it works as follows. We start with some prior knowledge of the para¬ 
meter we are estimating, expressed through a probability or density function of 
the parameter values. We then collect experimental or observational data of the 
random variable, and construct a posterior probability or density of the parameter 
based on both our prior knowledge and the observations. The parameter estimate 
is the expected value of this posterior probability. 


■ EXAMPLE 

We believe that a processor fails according to a Poisson process with rate X, 
which is the parameter we wish to estimate. Suppose we know that X is some¬ 
where in the range [10 _4 ,2 x 10 -4 ], and we express this knowledge by consid¬ 
ering X to be a random variable uniformly distributed over that range. Thus, 


/prior(k) — 


to 4 if nr 4 ^w2x hr 4 

0 otherwise 


The current estimate for X is its expected value, X = 1.5 x 10 -4 . 

Suppose now that we run the processor for r hours without observing a fail¬ 
ure. The posterior density of X, which incorporates the information gleaned 
from this experiment is as follows: 
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/posterior(^) =/prior('Mlifetime ^ r) 

Prob{ Life time > r |Failure rate = A}/p r i or (A) 
fe= 10-4 ProbjLifetime ^ r |Failure rate = fd/prior(f) d£ 


~,—Xz 


/priorW 


f?=10 -4 e €r /prior(^)d£ 


10 4 e~ 


io 4 /; = io-4 e ~ tzM 

0 otherwise 

1-4 


if A e [10 ,2 x 10 -4 ] 


-O.OOOli _p-0.0002t 


0 


if 10 -4 < A < 2 X 10 
otherwise 


-4 


The estimate of A is now given by the expected value of this new density 

(1 + 0.0001r)e _oooolT - (1 + 0.0002r)e- aooo2T 


n 2xl0- 4 

A — / A Jposterior(A) dA : 

A=io-4 


T ( e -0.0001r _ g—0.0002r^ 


Figure 10.2 plots the estimate of A based on observed values of r. Note that as 
r increases, A tends to the lower bound of the [0.0001,0.0002] interval; it can 
never go outside this interval, however. ■ 


The Bayesian approach is controversial because it depends on the existence of 
prior information about the parameter being estimated. In some cases, this infor¬ 
mation may not be difficult to derive. For instance, if we are asked to evaluate 
an unknown coin, we can assume that the probability of getting a "head" is uni¬ 
formly distributed over the entire possible range of [0,1]. In other cases, it may not 
be possible to express prior information with any confidence. 

Note also that if the prior density is zero over any given parameter inter¬ 
val, it will remain zero for that interval no matter what the experimental results 
are. In our earlier example, we started with a prior density that was zero out¬ 
side the interval [10 _4 ,2 x 10 -4 ]. Since the posterior densities are constructed by 
multiplying this prior density by some additional terms, all posterior densities 
will also be zero outside this interval only. When the prior density is zero over 
some interval l, it means that we already know that the parameter cannot fall in 
that interval. Since this knowledge is assumed to be correct, no amount of pos¬ 
terior information can result in the probability of falling in I being anything but 


zero. 
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FIGURE 10.2 Estimate of A based on observed r. 


10.2.5 Confidence Intervals 


A confidence interval with confidence level 1 — a for an unknown parameter 9 is an 
interval [a, b] calculated as a function of a sample of size n, X\,... ,X n , in such a way 
that if we calculate similar intervals based on a large number of samples of size n, 
a fraction 1 — a out of these intervals will actually include the real parameter 9. 
1 — a is usually selected to be 0.95 or 0.99, also expressed as 95% or 99%. 

The most common use of confidence intervals in engineering applications is 
that of calculating a confidence interval for the expectation, /a, of some random 
variable, and this is discussed next. Our treatment rests on a fundamental result 
of probability theory: the Central Limit Theorem. We state it here without proof. 


Central Limit Theorem. Suppose X\,X 2 , ■■ ■ r X n are independent and identically dis¬ 
tributed random variables with mean p and standard deviation a. Consider the average 
of these variables, X — Xl+X2 ) ~|~'" +X ” . In the limit, as n —> oo, X approaches the normal 
distribution, with mean p and standard deviation a / ^Jn: tins means that for a large n 


1 





y-e \2 
o/Jn’ 


d y 


Fg(x) — ProbfX ^ x} 
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Stated slightly differently, for a large sample size n 


Prob 


X-jJL 

a/y/n 


0(z) 


(10.3) 


where 

*(z) = 4= f e^ 2/2 dy 

\JL71 J — oo 

is the probability distribution function of a standard normal random variable (with 
mean 0 and standard deviation 1). We should stress that this is an approximate 
result; it gets more exact as n —> oo. 

Let us now define Z p to be the number for which c P(Zp) — p. Then, we have from 
Expression 10.3 that, in the limit as n oo, 


Prob 


X-ii 
a / ~Jn 


^Z 


l-S 



and 


Prob 


X-n 
a / ~Jn 


> Zi _ a 
1 2 


Since <P(z) is symmetric about z = 0, 



a 

2 


Prob 


X-n 
a / y/n 


X —Zi 


a 

2 


and therefore. 


Prob 


~Z 1 _a iP —p < Z l _a \ — \—0l 

2 a / *Jn 


or stated differently, 
Prob 


-a a 

X -— Zi_« ^ u ^ X H- —Z\_a \ — \ — a 

2 sfn 2 ' 


(10.4) 


The interval 


[a,b] 



X- 


~Jn 


-l- 


a_ 

2 


(10.5) 


is called al-a confidence interval. 1 — a is called the confidence level of the interval. 
So long as the experiment has not yet been conducted and X remains a random 
variable, there is a probability of 1 — a that the true mean, // , will be included in 
the interval. Once we have calculated X (based on simulation or experimentation), 
it is no longer a random variable; it is a fixed number. Since p. is also a fixed num¬ 
ber, it is either inside or outside the calculated confidence interval. The level of 
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confidence 1 — a is therefore not the probability that the true mean lies within the 
calculated interval; it is rather the confidence we have in the method of calculation 
that was used to generate the interval—it is successful in 1 — a of the cases. This is 
a subtle technical point, which does not affect how we use confidence intervals. 


■ EXAMPLE 


Suppose we wish to estimate the mean lifetime (in months), //, of a device, 
by constructing for it a 95% confidence interval. In a sample of n — 50 such 
devices, we obtained an average lifetime of X = 37 months, with a standard 
deviation of a — 5 months. Looking up a table of the standard normal distri¬ 
bution, we find that Z 0.975 = 1.96. Hence, the 95% confidence interval for // 
is 


M] = 


37- 1.96- 


5 

V50 


37+1.96- 


5 ' 
V50_ 


= [35.61, 38.39] 


We now say with a confidence of 95% that the expected lifetime of a device of 
the type analyzed is between 35.6 months and 38.4 months. ■ 


■ EXAMPLE 

Suppose the confidence interval obtained in the previous example is too wide 
for our requirements; we need a 95% interval that is not wider than 1 month. 
Since we have no control over a and Z \_», the only way to make the interval 
narrower is by increasing the sample size n. We require that 


2 • Z l_a<J / ^ 1 


or 


2 • 1.96 • 5/Vn < 1 


which results in 

n > (2 • 1.96 -5 ) 2 = 384.16 

We therefore need a sample of at least 385 devices in order to obtain the re¬ 
quired accuracy in estimating //. ■ 
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■ EXAMPLE 

A given system either fails during the day or it does not. We want to estimate 
the probability p that it does fail, using a 99% confidence interval. To estimate p 
based on n experiments or simulation runs (where each experiment represents 
one day), we define 

I I if the system fails in experiment i 
0 otherwise 


Since £(X) = p, our estimate of p is 


p = X = 


TLl*i 


p is actually the fraction of days in the sample on which the system failed. 
To get a confidence interval for p, note that Var(X) = p(l — p) and <r(X) = 
■y/p( 1 — p). Relying once more on the Central Limit Theorem, and using p in¬ 
stead of the unknown p, we obtain the approximate confidence interval for p 
at confidence level 1 — a 


[a,b] 


p-z i_« 


/p(l -p) 


J l-S 


/p(l -p) 


Suppose we conducted n — 200 experiments out of which the system failed 
in 12 cases, resulting in p = 0.06. From tables of the normal distribution we 
can determine that Z 0.995 = 2.57. Our 99% confidence interval is therefore the 
interval 


0.06 - 2.57. 


0.06 x 0.94 


200 


-, 0.06 + 2.57. 


0.06 x 0.94 


200 


= [0.017, 0.103] 


We can say with a confidence of 99% that the failure probability is somewhere 
between 1.7% and 10.3%. 

The last interval has a width of 0.086 and is clearly not informative enough 
for most applications. To get a more accurate result we need to increase n. 
Say, for example, that we require the width of the confidence interval to be 
no larger than 0.002 (which implies that the estimate will be removed at most 
0.1% from the real failure probability, with a confidence of 99%). What should 
the number of experiments (or simulation runs) be? Based on our "pilot study" 
we have p = 0.06, and therefore 


2 x 2.57 


V0.06 • 0.94 
~Jn 


< 0.002 
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which results in 


n h 


4 • 2.57 2 • 0.06 • 0.94 

0 . 002 2 


= 3.7 x 10 5 


In most instances, it will be impractical to conduct so many experiments. ■ 


The last example has highlighted a major problem in high-reliability systems: 
in most cases, we will need a substantial amount of data to validate statistically 
the high reliability of the system. Suppose we are trying to validate by experiment 
that the true failure probability, p, of a life-critical system is 10 -8 . For such a low 
failure probability to be validated, we need a very high level of confidence indeed, 
say 99.999999% (or even higher), requiring a truly astronomical volume of data. 
We explore this matter further in the Exercises. 

10.3 Variance Reduction Methods 

As is evident from Equation 10.5, the length of a confidence interval is inversely 
proportional to *Jn, where n is the number of simulation runs or experiments, and 
proportional to the standard deviation of the random variable under study. Note 
that the standard deviation that is used in calculating the confidence interval is 
itself in practice an estimate obtained from the simulation data and may therefore 
vary slightly with n. The brute-force way to shrink the confidence interval of an 
estimate is obviously to increase n. However, in the interest of efficiency, we should 
also consider the option of somehow reducing the variance (and, consequently, the 
standard deviation) of the estimate. In this section, we consider several schemes 
for doing so. 

The first two approaches rely on the following facts from elementary statistics: 
E(X + Y) = E(X) + E(Y) and Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y) 

where Cov(X, Y) = E([X — £(X)][Y — E(Y)]) is called the covariance of X and Y. 

10.3.1 Antithetic Variables 

Suppose we run simulations to estimate some parameter (for example. Mean Time 
to Data Loss [MTDL] in a RAID system). In traditional simulation, we would run 
n independent simulations and use the results. If Z \, Z 2 are the outputs from two 
independent runs, we can expect that 

Cov(Zi,Z2) = 0 


so that 


Var 


Z\ + Z 2 \ Var(Zi) + Var(Z 2 ) 


When the method of antithetic variables is used, we try to run simulations in 
pairs, coupled together in such a way that their results (any parameter that is es- 
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timated by the simulation, be it reliability, waiting time, etc.) are negatively cor¬ 
related, and then treat Y = {Z\ + Zf)/2 as the output from this pair of runs. If the 
simulation pair produces the outputs Z\,Zi such that Cov(Zi,Z 2 ) < 0, the vari¬ 
ance of Y will be smaller than it would be if the two runs were independent and 
not coupled. 

A good way to couple pairs of simulation runs is to couple the random vari¬ 
ables used by them. Suppose the output of the simulation is a monotonic func¬ 
tion of the random variables and the first run of the pair uses uniform random 
variables Ui, LI 2 , ■ ■ ■, U n , then the second run can use 1 — LI\ , 1 — Uo ,..., 1 — U n . 
The corresponding random variables in the two sequences are negatively corre¬ 
lated: if If, is large, 1 — If, is small, and vice versa. This applies even when the 
distributions of the random variables used in the simulation are not uniform. 
We are assuming that in order to generate such random variables, we will ul¬ 
timately need to call uniform random number generators (URNGs), described 
later in Section 10.4.1. We can apply the coupling on the output of these URNGs. 
For example, if we need to generate exponentially distributed random variables 
by using X = —(1//x)lnlf, the coupled simulations will generate U and then use 
Xj = — (l//x)lnlf and X 2 = —(l//z)ln(l — If), respectively (see Section 10.4.3). 

In particular, if we can write the simulation output as being a monotone func¬ 
tion of the uniform random variables used, then it is possible to show that the 
simulation outputs will indeed be negatively correlated when the method of an¬ 
tithetic variables is used. Showing this is outside the scope of this book; see the 
Further Reading section for details on where to find the proof. 


■ EXAMPLE 


Consider a structure composed of k elements. Denote by S; the state of com¬ 
ponent i: a functional component is denoted by S; = 1, whereas if it is down 
we have S; = 0. A structure function f(S\, S 2 , ■ ■ ■, Sf) is an indicator function (as¬ 
sumes the values 0 , 1 ), which expresses the dependence of the functionality of 
the system on the functionality of its components: it is equal to 1 if the system 
is functional for Si,..., S/t and to 0 if it is not. 

For instance, if the system consists of k elements connected in series, we 
have 


0(Si, S 2 ,..., Sfc) = Si x S 2 x ■ • ■ x Si¬ 


lt it is a triplex system with a perfect voter and S; denotes the state of the zth 
processor, then 


0(Si, S 2 , S 3 ) 


1 if Si + S 2 + S 3 ^ 2 
0 otherwise. 


Now suppose we want to simulate the reliability R, for some given length of 
time f, of a system with a very complex structure function that cannot easily be 
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analyzed. Using traditional methods, we would run a simulation by generat¬ 
ing random variables that would determine whether individual components 
were up or not, and then determine whether the overall system was functional 
during [0, t], Using antithetic variables, we will run the simulations in pairs, 
with the random variables coupled as described above. If Y, is the average of 
the values of the structure function from the two simulation runs in pair i, and 
we run a total of 2 n simulations (or n pairs), then the estimated reliability of 
the system is 

p_ Yi + Y2 + -'- + y » 

n 

Furthermore, the variance of the estimate is likely to be far lower than would 
be obtained if we ran 2 n independent simulations. 

It is important to note that the Y,s are independent of one another. That is, 
although each run consists of paired simulations, there is no coupling between 
one pair and another. This allows us to use traditional statistical analysis on 
the Y/S. ■ 


By how much can we expect the variance of the estimate to drop? This depends 
on the covariance of the two outputs in each pair of runs. In the Exercises, you are 
invited to determine the usefulness of this approach in a variety of cases. 

10.3.2 Using Control Variables 

When simulating to estimate the mean value £(X) of a random variable X, select 
some other random variable, Y, whose expectation is known or can be calculated 
precisely to be 9y- Consider the random variable 

Z = X + k(Y-6 Y ) 


Z has the properties 


£(Z) = E(X), 

Var(Z) = Var(X) + k 2 Var(Y) + 2k Cov(X, Y) 

Hence, if we can pick k suitably, we can exploit any correlation between X and Y 
to reduce the variance of the estimate of £(Z) (note that £(X) = £(Z)), and then use 
simulation to estimate £(Z) rather than £(X). Because Var(Z) Y Var(X), this will 
result in a narrower confidence interval. Y is called the control variable or control 
variate. 

It is easy to show that Var(Z) is minimized when 

, Cov(X, Y) 


Var(Y) 
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For this value of k, 


Var(Z) = Var(X) - 


(Co v(X,Y)) 2 
Var(Y) 


If Cov(X, Y) and Var(Y) are not known in advance, we can estimate them by run¬ 
ning n simulations (for some initial small n), generating X„ Y; for i — , n and 

using the following estimates: 


Cov(X, Y) = 


E,Li(X,--X)(Y,--Y) 

n — 1 


and 


Var(Y) = 


Ei-Li(y «-y ) 2 

n — 1 


where X = Skk an d y = Skli. 


■ EXAMPLE 

We are interested in estimating the reliability (at time f) of a complex system 
that uses processor redundancy without repair. We can use as control variable 
the number of processors that are up at that time. ■ 


10.3.3 Stratified Sampling 

The method of stratified sampling is probably best introduced through an 
example. 


■ EXAMPLE 

A computer system runs daily from 9 AM to 5 PM and is available for repair 
only after 5 PM. We wish to simulate the system and estimate the probability, 
7r, that the system survives through a randomly selected day. Because the fail¬ 
ure rates of the processors are different on weekdays and on weekends due to 
different utilizations, the system has two different survival probabilities—jri 
on a weekday and 7t2 on a weekend day. 

The conventional way to do a simulation experiment is the following: for 
each run, first select the day at random (weekday with probability pi = 5/7, 
weekend with probability p 2 = 2/7), apply the appropriate failure rate for that 
type of day, and then simulate for the behavior of the system over that day. If 
it fails during run i, set X, = 0; if it survives, set X, = 1. Make n runs for a 
sufficiently large n and then estimate the survival probability as fc — (Xi + 
X 2 + • • • + X n )/n. 
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A better approach, which uses the method of stratified sampling, is to carry 
out two sets of runs. Set 1 consists of n\ runs in which the system is simulated 
under weekday conditions (with the failure rates set appropriately), and set 2 
consists of n 2 runs (where n\ + n 2 — n) with the failures rates set according to 
weekend conditions. Then, if the survival probability estimated from set i is jr, 
(i i — 1,2), the overall survival probability is estimated as 

7 r = (5/7)jri + ( 2/7)jt 2 


Denoting 

V\ — Var(X | Weekday) = tt\(1 — tt\) 

and 


V 2 = Var(X | Weekend) = 712(1 — 112 ) 


we obtain 


Var(7r) = 


(5/7) 2 V\ 

n\ 


(2 /7) 2 V 2 
n 2 


We claim that this second approach can be expected to yield estimates with a 
smaller variance if n\ and M 2 are chosen appropriately. There are two ways of 
choosing ni and U 2 ‘- 


■ The most straightforward way is to set «, = /?p/. 

■ A better approach is to use a pilot simulation to obtain a rough estimate 
of V\ and V 2 , and select /?, to minimize the variance of the estimate under 
the constraint n\+n 2 = n. 


In general, suppose we are running a simulation to estimate the mean value, E(X), 
of some random variable X, and that this mean value depends on some para¬ 
meter, Q e b/i,t/ 2 ,.. ■£](:}■ Suppose we can accurately calculate p, = Prob{Q = t;,}, 
i — 1,2,... , 1 . 

Using the stratified sampling approach, we first run «/ simulations to estimate 
£(X) conditioned on the event {Q = q,), for every i= 1, ...,£. Then, we estimate 
£(X) by applying the Total Probability formula. That is, 

E(X) = £[£(X|Q)] = E(X|Q = qfa + E(X\Q = q 2 )p 2 + ■ ■ ■ + E(X\Q = q t )p e 

The effectiveness of the stratified sampling approach is based on the identity that 
you are invited to prove in the Exercises: 

Var(X) = £[Var(X|Q)] + Var[E(X|Q)] 
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The actual amount of variance reduction will depend on the extent of the cor¬ 
relation between X and Q. In effect, we are using our knowledge of Prob{Q = q-, | 
to reduce the variance, since Q itself does not need to be simulated any more and 
the variability introduced by simulating it is eliminated. 


10.3.4 Importance Sampling 

In the importance sampling approach to simulation, we simulate a modified sys¬ 
tem in which the chance of failure has been artificially boosted and then correct for 
that boost. A detailed development of the theory is beyond the scope of this book: 
we have limited ourselves to providing just an introduction to it. There are three 
reasons for this. 

■ Importance sampling is a temperamental technique. If not carefully used, it 
can end up actually increasing the variance of the simulation estimate. 

■ It is not yet a mature technique. It is, rather, the focus of much current 
research. 

■ It is more mathematically complicated than anything else encountered in 
this book. 

The importance sampling approach is based on the following reasoning. Suppose 
we want to estimate by simulation some parameter 0 — E[</>(X)] where (/>(■) is some 
function and X is a random variable with probability density function/(x). 

Assume that g(x) is a probability density function with the property that g(x) > 0 
for all x for which/(x) > 0. Then, 


E[m] = / 4>(x)f(x)dx 



</>(*)/(*) 

g(x) 


g(x) dx 


J i lr(x)g(x) dx 


( 10 . 6 ) 


where i/r(x) = Now, f ijf(x)g(x)dx is equal to E[ifr(Y)\, where Y is a ran¬ 

dom variable with probability density function g(-). This suggests that we estimate 
E[i/r(Y)] rather than £[0(X)] (although both are equal to 0). 

More precisely, the standard approach to estimating 6 — E(c/;(X)) would be to 
obtain a sample of X, namely, Xi, X 2 ,..., X„, and estimate 0 as 


§ = 0(X) = 


1 


n 


i=\ 
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The importance sampling approach is to obtain a sample of Y (with density func¬ 
tion g{y)), denoted by Y\, Yi ,..., Y n , and then estimate 9 as 

_ i n 

0 = V/(Y) = - V f(Yd 

n z —' 

i=i 

For this method to be beneficial, it is necessary that 

Var (t/KY)) < Var (0(X)) 

This will happen if we select some g(x) with the property that f{x)/g{x) is small 
whenever <p{x) is large and vice versa. The choice of g(x) is crucial to the reduction 
of variance: an incorrect choice can render the method of importance sampling 
counterproductive by actually increasing the variance. 


■ EXAMPLE 


Consider two random variables A and B , each exponentially distributed with 
parameter //. That is, their density functions are each of the form/(x) = 
for x ^ 0. Then, suppose we want to use simulation to estimate the parameter 
9 — Probjh + B > 100}. Assume that // 1/50, so that it is unlikely that A+B > 

100 (and 9 is therefore very small). 

We could obviously solve this problem analytically, without any need for 
simulation. However, let us use it as a vehicle to explain how the principles of 
importance sampling could be used here. 

Using the conventional approach, we would generate two samples of size 
n for A and B: a\,ai,... ,a n and b\,b2, ■ ■ ■ ,b n , respectively. Define 


<t>(ai,bi ) 


1 if a, + bj > 100 
0 otherwise 


Because 9 — E((p(A,B)), we can estimate 

1 " 

9 = -Y 1<t>(ai,bi) 
i=l 

As we saw in Section 10.2.5, we will need a very large number of observations 
to accurately estimate a very small value of 9. In the importance sampling 
approach, we change the density function so that larger values of A and B are 
more likely. In particular, let us use the density function g(x) = ye~ yx for some 
y [x. Using this density function, we generate values of A and B denoted by 
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a' v a' 2 ,..a' n and b' v b' 2 , ... r b' n . We then use the estimate 




Mm 


i =1 





2 


It now remains for us to obtain a suitable value of y to reduce the variance 
of the estimate. Denoting the zth term of the above sum by S„ we note that if 
a'i + b' j ^ 100, Si — 0. Also, if «'• + b’ i > 100, then 

(^J 2 e-^-rM+W ^j 2 e -W0(n-y) 

Selecting y to minimize 

^j 2 e -W0(»~r) 

will minimize the maximum possible value of S, and thereby reduce the vari¬ 
ance of Sj. Simple calculus shows that y = 0.02 minimizes the above quantity. 
Thus, the importance sampling approach to this problem is as follows: 

■ Generate a', according to the density function g(x) = 0.02e _0 02x , for i — 
1 , 2 ,...,«. 

■ Define 0(flj, Z/) = 1 if fl'- +1>'. > 100 and 0 otherwise. 

■ Estimate 0 by 



e 


1 

n 


i=l 


X 2 

p -(/i-o.o2)( 0 ;+b;.) 

0.02 / 


Simulating Continuous-Time Markov Chains: 

Mean Time Between System Failures 

Suppose the system we are analyzing can be described by a Markov chain (see 
Chapter 2) with continuous time f, also called a CTMC (continuous-time Markov 
chain). Let be the rate of transition from state i to state j, then. A.; = Hj^i M 
is the total rate of departure from state i. The sojourn time of the system in each 
state (the time it stays in a state before leaving it) is exponentially distributed with 
parameter A, for state i. 

Now, suppose that all the transitions in the chain are either component failure or 
repair transitions. A subset of the states, those in which the system is considered to 
have failed, are called system-failure states. 
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FIGURE 10.3 A continuous-time Markov chain. 


■ EXAMPLE 

Consider a system of three processors that can fail and be repaired, and sup¬ 
pose the system behaves according to the Markov chain depicted in Fig¬ 
ure 10.3. The state is the number of processors that are functional. The fail¬ 
ure transitions are 3—>2, 2 1, 2 —>0, and 1 -> 0. The repair transitions are 

2 —» 3, 1 —> 2, and 0 -> 1. The rates of transition are as shown on the arrow 
labels. From this, we can write 

M = A.32 

X 2 — Ml + Mo + M3 
M = Mo + M2 
M = Mi 

Suppose the system is operational as long as at least one processor is opera¬ 
tional, then the set of system-failure states is {0}. ■ 


Going back to the general failure-repair Markov chain, we are interested in find¬ 
ing the mean time between system failures (MTBF). Because repair is usually much 
faster than time between component failures, the chain makes a large number of 
transitions before it enters one of the system-failure states, and thus the simulation 
will have to run for a very long time to measure the time until the system fails. We 
can use importance sampling to speed up the simulation as follows. 

Let us define state N as the initial state with all components functional, and let 
t — 0 be the time at which the simulation starts. By definition, there are no repair 
transitions out of state N; there can only be failure transitions. Let F be the set 
of system-failure states. Since we are considering systems with repair, there will 
be one or more repair transitions out of each state with any failed components. 
Ultimately, the system will return to state N. Let this time of return be tr, the 
system regeneration time. (At this point, the system is as good as new). Let rp be the 
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time until the system first enters a system failure state. Then, you are invited in the 
Exercises to show that 


£[min(r R , r F )] 

EM = h - hi-T 

rrob{r F < tr) 


(10.7) 


In most systems, where repair rates are much greater than failure rates, we can 
expect that £[min(r R , t f )] will be only slightly smaller than E(t r ), since the system 
can be expected to return to state N many times before it enters a system-failure 
state. We can expect the system to return to state N fairly quickly. So, traditional 
simulation can be used to estimate £[min(r R , r F )]: just calculate the average length 
of time it takes the system to return from state N to state N. 

Estimating 0 = Prob{r F < tr }, on the other hand, should be done using impor¬ 
tance sampling because r F < tr is the rare event in which the system fails before 
returning to state N. Notice that we no longer need to keep track, in our simu¬ 
lations, of the time it takes to make the transitions, or of how long r F or tr may 
be; all we need to record is the fraction of times that r F < tr. This means that we 
do not need to change the sojourn time of the system in any of its states, just the 
transition probabilities. 

The technique we will follow to implement importance sampling is called bal¬ 
anced failure biasing. Before presenting it, we have to introduce some notation. Each 
transition out of any state represents either a failure or a repair event. In state N, 
since everything is functional, there can only be failure events. Conversely, in a 
state in which everything is down, there can only be repair events. Let np{i) be the 
number of failure transitions (the number of outgoing transitions denoting com¬ 
ponent failure events) out of state i. 

Since we are not interested in finding out the amount of time the system spends 
in each state, we need only simulate a discrete-time Markov chain (DTMC) embed¬ 
ded into the continuous-time chain. This is a DTMC that studies just the progress 
of the system from one state to the next, without recording the sojourn time in each 
state. 

Suppose we have a CTMC that has the following events: It starts from state 
N, moves to state i\ at time t\, to state h at time Ir, etc. The sample-path for the 
corresponding embedded discrete-time Markov chain will be N, then z’i, then i\, 
etc. 

We now define a probability transition function for the DTMC, p (; , which is the 
probability that the system moves to state j given that it was in state i. It can be 
shown that 

*ij 

Pv 


Intuitively, the probability that the system will transit from state i to state j is the 
rate of going from i to j as a fraction of the total rate of leaving state i. 
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Define by PrC) the probability of making a repair transition out of state i. Now, 
pick some p* (usually 0.2 to 0.4 works well) and define a new DTMC characterized 
by transition probabilities p,j defined as follows: 

■ Case l.i — N 


if i —> j is a failure transition and > 0 

0 otherwise 


Case 2. i is neither N nor a system-failure state and Pr(i) > 0 

if i -> / is a failure transition and p (/ > 0 


( p* 

n F (i) 


Pv = 


(1 - p*)pij 

-—— if i -> / is a repair transition and > 0 

Pr(i) 1 


0 


otherwise 


Case 3. i is not a system-failure state but pR(i) — 0 

1 

—pc if Pij > 0 

Pij = \ 

0 otherwise 


■ Case 4. i is a system-failure state 

Pij = Pij 


We have only modified transition probabilities out of states that are not system- 
failure states. For these, we have done the following: 

■ The total probability of making a failure transition is now p*. 

■ This probability is equally divided among all the failure transitions. 

We now perform n simulation runs of the modified system, recording for each 
the likelihood ratio of the sample path (where the sample path is the sequence of 
states that are visited). The likelihood ratio for simulation run k, L/ : , is defined as 


4 = 


Probability of the original DTMC having this sample path 
Probability of the modified DTMC having this sample path 


I I if simulation run k ends with system failure 
0 if simulation run k ends with the system back in state N 


Let 
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Then, we estimate 

0 = Prob{r F < r R } = ^ k=lIkLk 
n 

Let us now relate this to Equation 10.6. The transition probabilities that we use 
to simulate the system (the p,j values) are a realization of g(x). L/ f is a realization of 
f(x)/g(x). Finally, 4 is a realization of (p(x). Because failure is a discrete event, we 
replace the integral in Equation 10.6 by a sum. 


■ EXAMPLE 

Consider the system shown in Figure 10.4a: its embedded DTMC is shown 
in Figure 10.4b. The labels for the CTMC arrows are the transition rates, and 
those for the embedded DTMC arrows the transition probabilities. By defi- 


10 



10/100 



(b) Embedded discrete-time Markov chain 


0.15 



(c) Modified discrete-time Markov chain 


FIGURE 10.4 A continuous-time Markov chain (CTMC) and its embedded discrete-time 
Markov chain (DTMC). Solid lines indicate failure transitions; dashed lines indicate repair 
transitions. 
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nition, the transition probabilities out of each state must add up to 1. (In a 
general DTMC, it is permissible for a state to transit to itself; this will never 
happen here since each transition represents either a failure or a repair event). 
State 0 is the only system-failure state. 

Now, suppose we select p* — 0.3. Consider the transitions out of each state, 
one by one. 

■ State 3. There is only one transition out of this state, to state 2. We therefore 
have P 32 = 1 . 

■ State 2. There is one repair transition out of state 2 and rtf(2) = 2 failure 
transitions. Each of these failure transitions will have probability p /2— 
0.15 of happening; the single repair transition will happen with probability 
1 -p* = 0.7. 

■ State 1. There is one repair transition and one failure transition out of this 
state: Hf('\) — 1, the failure transition will happen with probability p* — 0.3, 
and the repair transition with probability 1 — p — 0.7. 

■ State 0. This is a system-failure state: there is no change to the transition 
probabilities out of this state. 

Figure 10.4c depicts the modified DTMC. We will now simulate this chain, 
to estimate Probjr/r < tr} under the new transition probabilities. Suppose we 
decide to make a total of three simulation runs and average them to find an es¬ 
timate for this probability. (In reality, one would carry out perhaps thousands 
or even millions of such simulation runs, but we are just illustrating the tech¬ 
nique here.) We will simulate the system starting from state 3. The simulation 
will end when the system enters either state 3 (in which case, we have Tp > tr), 
or the system-failure state 0 (in which case, Tp < tr). Table 10-1 shows possible 
results for these runs. 

Consider the first of the three runs. The sequence of states is 3 -* 2 -> 3. 
The probability of such a sequence of transitions taking place in the modified 
DTMC is P 32 x P 23 — 1 x 0.7; the corresponding probability for the original 
DTMC is P 32 x P 23 = 1 x 0.87. The likelihood ratio is therefore ^pppy- (Recall 


TABLE 10-1 Three sample paths and the associated likelihood ratios 


Run No. Sample path Likelihood ratio tf < tr? 

1 3 , 2,3 

2 3 , 2 , 1 , 2 , 1,0 
3 , 2 , 1 , 2,3 


r _ 1x0.87 
- 1x0.7 

r _ 1 x (3/100)x (100/102) x (3/100)x (2/102) 
Ll ~ 1x0.7x0.15x0.7x0.15x0.7x0.3 

1 x (3/100)x (100/102) x (87/100) 
1x0.15x0.7x0.7 


L 3 = 


No 

Yes 

No 


3 
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that this is the factor that corrects for our modification of the transition proba¬ 
bilities to get pij). 

Similarly for the remaining runs. 

Run 2 of the three simulation runs is the only one that has resulted in the 
event Tp < tr. Therefore, h = 0,h = h h — 0, and our simulation estimate is 

~ 0 x Lj + 1 x L 2 + 0 x L 3 L 2 

0 = Prob{rp < tr} — ---= — = 0.0025 


Simulating Continuous-Time Markov Chains: Reliability 

To find reliability by simulation, the conventional way is to run the system until it 
enters a system-failure state and then find the total elapsed time to system failure. 
From these times we can obtain the probability distribution function of the time to 
first failure, whose complement is the reliability function. 

Balanced failure biasing can be used for shortening the simulation time for this 
case as well. There is, however, an important difference between calculating the 
reliability function and estimating the MTBF that we showed in the previous sec¬ 
tion. For the latter, we were able to avoid the task of actually storing durations and 
just counted the number of times the system failed before getting back to state N. 
In our present case, we have to maintain time information in our simulation. Also, 
we would like to force at least one transition out of state N. 

Doing the latter is quite simple. In a conventional simulation, we would use the 
density function /(f) = k\>e-'" Nt for simulating the sojourn time of the system in 
state N. In the forcing technique, we use instead the density function 


7(0 = 


X N e- x N‘ 

l_ e -%r 

0 


if 0 ^ f ^ X 
otherwise 


for some predetermined T. This forces at least one transition out of N prior to 
time T. 

The likelihood ratio associated with this choice is obviously/(f)//(f). In practice, 
we will combine forcing with balanced failure biasing, in which case the overall 
likelihood ratio will be the product of the likelihood ratios of the two. 

It is important to note that the forcing technique should only be used if 1 — 
e~ x N T is a relatively small quantity, and transitions out of state N are rare over the 
interval [0, T]. 

10.4 Random Number Generation 

At the heart of any simulation of probabilistic events is the random number gener¬ 
ator, whose job it is to generate independent and identically distributed (i.i.d.) ran¬ 
dom variables according to some specified probability distribution function. The 
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quality of such a generator is often critical to the accuracy of the results that the 
simulation produces, so that choosing a good generator is of considerable practical 
importance. We discuss in this section how to create random number generators 
and test their quality 

When faced with the need to generate a stream of i.i.d. random numbers accord¬ 
ing to some probability distribution function, we usually proceed in two steps. In 
the first step, we generate a stream of i.i.d. random numbers that are uniformly 
distributed in the range [0,1]; in the second, we transform these to fit the desired 
probability distribution. 

10.4.1 Uniformly Distributed Random Number 
Generators 

In an ideal world, we would be able to generate truly random numbers that were 
both distributed uniformly over [0,1] and statistically independent of one another. 
If we can identify some physical process that displays the appropriate stochastic 
properties, we could simply take measurements of that process. For example, one 
commercially available generator amplifies the shot noise plus the thermal noise 
in transistors and then uses a thresholding function to convert that noise to bits (if 
the noise is above the threshold, it is a 1, otherwise it is a 0). This stream of bits is 
then processed to produce a sequence of numbers that satisfy quite stringent tests 
of randomness. 

In most instances, however, we have to make do with random numbers gener¬ 
ated by a computer program. Herein lies a fundamental contradiction. Typically, 
such a sequence of random numbers Xi,X 2 ,... satisfies some function/(•) such 
that X;_|_i Given the seed, Xo, we can therefore predict what the numbers 

are going to be: there is nothing truly random about them. This is why numbers 
generated in such a way are called pseudo-random. We hope when we generate 
them that they look sufficiently random that we can get away with using them, 
rather than truly random numbers, in our simulations. In effect, we are trying a 
variation of the well-known Turing test for intelligence on the random number 
sequences. The Turing test for artificial intelligence is as follows. Have people in¬ 
teract with either a computer or a human being, without being told which. If they 
cannot make out from the responses they get to questions whether they are talking 
to a computer or a human, then the computer has intelligence. The variation that 
applies to random number sequences states that if we generate a pseudo-random 
sequence and give it to a statistician without telling her how it was obtained, she 
should not be able to distinguish between such a sequence and one generated 
truly randomly. This is an extremely stringent test, and one that most generators 
will fail. All that is realistic to hope for is that the pseudo-random numbers gener¬ 
ated will be sufficiently close to the real thing to make our simulations sufficiently 
accurate for our purposes. A major source of error in simulations is using poor 
quality random number generators. Later in this section, we will see how to test 
such sequences to determine if they satisfy statistical properties of randomness. 
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A commonly used set of URNGs generate linear congruential sequences of the 
form: 

X i+1 — (aXj + c) mod m, 0 ^ a, c < m 

where a, c, m are constants, m is called the modulus of the generator, a the multi¬ 
plier, and c the increment. If c — 0, this is known as a multiplicative generator. We 
start this iterative process by specifying Xo, the seed of this sequence. The prop¬ 
erties of the generator depend on the values of these constants. Given such a se¬ 
quence of integers (which must clearly be in the set {0,1,..., m — 1}), we define the 
sequence of fractions IT, = X J m, which are supposed to be uniformly distributed 
and mutually independent in the range [0,1]. 

Because the sequence X\,X 2 ,... must consist of numbers from a finite set, the 
sequence will repeat with time. That is, given any such generator, there always 
exists some M such that X, = X (+( vj. The smallest such M is called the period, P, of 
the generator, and clearly P ^m. 


■ EXAMPLE 

Consider the generator X n+ \ = (aX n + c)mod8. (We use an unrealistically 
small modulus just for illustration: in practice as we will see, very large moduli 
are used). We will show that the values of a and c are critical to the functioning 
of the generator. 

Start by considering the following set of results: 


seed 

0 

1 

2 

3 

4 

5 

6 

7 

Xi 

1 

4 

7 

2 

5 

0 

3 

6 

X 2 

4 

5 

6 

7 

0 

1 

2 

3 

X 3 

5 

0 

3 

6 

1 

4 

7 

2 

x 4 

0 

1 

2 

3 

4 

5 

6 

7 

X 5 

1 

4 

7 

2 

5 

0 

3 

6 

X 6 

4 

5 

6 

7 

0 

1 

2 

3 

x 7 

5 

0 

3 

6 

1 

4 

7 

2 

(j = 3; 

c = 

1; in 

= 8 







Note that for this sequence, every value of the seed results in a sequence of 
numbers with period 4. Let us now try another set of constants. 


seed 

0 

1 

2 

3 

4 

5 

6 

7 

Xi 

2 

4 

6 

0 

2 

4 

6 

0 

X 2 

6 

2 

6 

2 

6 

2 

6 

2 

X 3 

6 

6 

6 

6 

6 

6 

6 

6 

x 4 

6 

6 

6 

6 

6 

6 

6 

6 

X 5 

6 

6 

6 

6 

6 

6 

6 

6 

X 6 

6 

6 

6 

6 

6 

6 

6 

6 

x 7 

6 

6 

6 

6 

6 

6 

6 

6 


a = 2; c = 2; m = 8 
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The result is quite disastrous: a very non-random and correlated stream of 
numbers. This generator gets trapped into producing a stream of 6s, irrespec¬ 
tive of the seed. Let us try yet another set of constants. 


seed 

0 

1 

2 

3 

4 

5 

6 

7 

Xi 

1 

6 

3 

0 

5 

2 

7 

4 

X 2 

6 

7 

0 

1 

2 

3 

4 

5 

X 3 

7 

4 

1 

6 

3 

0 

5 

2 

x 4 

4 

5 

6 

7 

0 

1 

2 

3 

x 5 

5 

2 

7 

4 

1 

6 

3 

0 

x 6 

2 

3 

4 

5 

6 

7 

0 

1 

Xy 

3 

0 

5 

2 

7 

4 

1 

6 


a = 5; c = 1; m = 8 


For these values of a and c, we have, for every seed, a sequence of numbers 
with the maximum period of 8. We should caution that it does not automati¬ 
cally follow that this is a good generator, just that it passes a basic sanity check. 


The Linear Congruential Generator (LCG) has a period of m if and only if each 
of the following properties hold: 

■ c and m are relatively prime (their highest common factor is 1). 

■ For every prime number p that divides m, a — 1 is a multiple of p. 

■ If m is a multiple of 4, then a — 1 is also a multiple of 4. 

The proof of this result is outside the scope of this book; see the Further Reading 
section for where to find it. 

Since random number generators are so important in simulation, many re¬ 
searchers have carried out extensive searches in the parameter space to find gener¬ 
ators with good properties. One widely used generator with fairly good statistical 
properties uses the parameters a = 16807, m = 2 31 — 1, c = 0. 

The periods of LCGs are limited by m, and that can be a problem for running 
very long simulations. In simulating fault-tolerant systems, in which a very large 
number of events must be generated for each system failure that is encountered, 
the periods of such generators are often much too small. For example, in the gen¬ 
erator mentioned above, m — 2 31 — 1 = 2,147,483,647, and it is entirely possible to 
have in a simulation more than two billion calls to a random number generator. 
Because we want the number of calls to be much less than the generator period, 
we can use combined generators. One way of doing this is to select parameters ciy, 
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m\, m 2 , k and define 


X\ ,n = + auXl t n-2 H - h a lk)^l,n-k) mod m\ 

X 2 ,n — (^21X2,n -1 +(122X2,11-2 H-f fl2fc^2,n-fc) m °d m2 


Now, if these parameters are carefully chosen, the sequence 


Ur, = 



X 2 ,n \ 

m 2 J 


mod 1 


(by mod 1 we mean the fractional part of this expression) will have properties close 
to those of i.i.d. uniformly distributed random variables. 

After a long computer search for suitable parameters for such a generator, the 
following has been recommended as having good statistical properties: k — 3, m\ — 
2 32 — 209, (flii,fli 2 /^i 3 ) = (0,1403580,-810728), m 2 = 2 32 — 22853, (( 121 ,( 122 ,( 123 ) = 
(527612,0, —1370589). Such a generator has main cycles of length approximately 
2 191 . See the Further Reading section for details. 


10.4.2 Testing Uniform Random Number 
Generators 

All tests for URNGs ask the following question: How faithfully does the output 
of the URNG follow the properties of a uniformly distributed stream of random 
numbers that are statistically independent of one another? To answer this ques¬ 
tion, we must first identify some of the key properties of interest. 

The most obvious property is that of uniformity. That is, we would like to cal¬ 
culate the extent to which the output is uniformly distributed over the range [0,1]. 
Suppose we generate 1000 numbers and find that all of them are in the range 
[0,0.7], Now, it is not impossible that a set of 1000 numbers selected independently 
and uniformly from the unit interval should all fall in the range [0,0.7]: the proba¬ 
bility of this event is 0.7 10 °° = 1.25 x 10 155 , which, although very small, is certainly 
not zero. Thus, if we get such a sequence from the URNG that we are testing, we 
cannot say for sure that the URNG is bad: all we can say is that it is very unlikely that 
a good generator will produce a sequence like that, and consequently we declare 
the generator bad. 

We present next some ways of testing the goodness of a URNG. As with any sta¬ 
tistical test, there is an interplay between sensitivity and specificity in the follow¬ 
ing tests. Looking for too high a sensitivity can result in a high chance of declaring 
a generator bad when it is actually good. 

The x 2 Test 

Use the URNG to generate a large sequence of numbers. Define do, a\, ( 12 , ■ ■ ■, ttk—\,Uk 
for some suitable k such that 


0 = Aq < a \ < a 2 < ' '' < flJt -1 <a k — 1 
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and define intervals f/ = [a u a- l+ \ ) for i = {0 ,...,k — 1}. Then, let Oj and £, be the ob¬ 
served and expected frequencies, respectively, of generated numbers to fall within 
interval f„ and define the quantity S, which measures the deviation of the observed 
frequencies from the expected ones, as 


k -1 


s=£ 

1=0 


(Oi-Ej) 2 

E, 


Clearly, a good URNG will result in a small value of S. It can be shown (the Fur¬ 
ther Reading section has a pointer to where you can find this derivation) that if 
the random numbers were the output of a perfect URNG, and we have a large 
number of them (with at least five expected to fall within each of the intervals I;), 
S approximately follows the / 2 distribution with k — 1 degrees of freedom. 

It is easy to find tables of the y 2 distribution in books on statistics or on the 
Internet. We reject the URNG if S is so large that the probability that a true URNG 
would generate such a deviation (or a larger one) is very small (say, less than 5%). 


■ EXAMPLE 

Let us break up the interval [0,1] into 10 equal subintervals, each of length 0.1. 
Thus, Ii — [0.11,0.1/ + 0.1), for i — 0,2,... ,9. Suppose that after generating 1000 
random numbers, we get the results shown in Table 10-2. Let us pick 0.05 as 
our significance level; this means that we will reject the URNG if it results in 
a sum S such that the probability of an ideal URNG generating such a sum or 
larger is less than 0.05. Referring to a / 2 table with 9 degrees of freedom we 
see that, at a significance level of 0.05, we should reject the URNG if S > 16.9. 
Because in this example, we have S = 331.98, we reject this generator—it de¬ 
viates too much from the expected behavior. There is a very small probability 


TABLE 10-2 Illustrating the x 2 test 


O, Ei (O t - E,) 2 (O, - E,) 2 /E, 


0 

15 

100 

7225 

72.25 

1 

100 

100 

0 

0.00 

2 

200 

100 

10000 

100.00 

3 

88 

100 

144 

1.44 

4 

100 

100 

0 

0.00 

5 

100 

100 

0 

0.00 

6 

90 

100 

100 

1.00 

7 

80 

100 

400 

4.00 

8 

27 

100 

5329 

53.29 

9 

200 

100 

10000 

100.00 

TOTAL 




331.98 
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(much smaller than 0.05) that a good URNG will produce a sequence of num¬ 
bers like this one. ■ 


Serial Test 

To test whether a URNG produces uniformly distributed random numbers is nec¬ 
essary but certainly not sufficient. To see why, consider the following generator 
(this is an extreme, contrived example whose sole purpose is to make a point). 
Generate Y\, Y 2 ,..., Y n using any URNG that closely follows the uniform distrib¬ 
ution. Then, generate a sequence Z\, Z 2 ,..., Z,„ such that for some k > 1, 

Zi = z 2 = ... = z * = y 1 
Zfc+1 = Zfc + 2 = • • • = Z 2 fc = Yi 

Z(n—l)k+l = Z( n _l)/c+2 — ''' — Z n k — Y n 

If Y\,... ,Y 2 follows the uniform distribution sufficiently closely, the sequence Z, 
will pass the y 2 test. However, the Z,s would certainly not be acceptable because 
they are highly correlated. So, we need to test for lack of correlation as well: in 
other words, we have to test for the statistical independence of successive num¬ 
bers. Such an independence is really a fake: the nth random number is a func¬ 
tion of the (n — l)st. All that we are really testing for is whether the sequence 
of generated numbers looks like an independent sequence. Similarly, it is entirely 
possible (though unlikely) that we would independently generate random num¬ 
bers that appear correlated. The best we can realistically do is ask the question: Is 
the probability sufficiently high that such a sequence of numbers would be gen¬ 
erated by an ideal generator that produced numbers independently of one an¬ 
other? 

To test for correlation between successive numbers, we can use the serial test. In 
k dimensions, the test is as follows. Generate a sequence of random numbers and 
then group them together into /c-tuples as follows: 

Gi = (X 1 ,X 2 ,...,X fc ) 

G 2 = (Xfc+1, Xj-_|_ 2 ,. .., X2k) 

G 3 = (X 2k+1 ,X 2k+ 2 , ■ ■ ■ ,X 3 fc) 


Then, divide the /c-d imensional unit cube into n equal subcubes, count the number 
of /c-tuples that fall into each of the subcubes, and check (using the y ~ test) whether 
the /c-tuples are uniformly distributed among the subcubes. 
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(b) URNG B 


FIGURE 10.5 Comparing two generators. 


■ EXAMPLE 

Suppose we are testing for correlation in two dimensions. To do this, we gen¬ 
erate pairs (Xi,X 2 ), (X 3 ,X 4 ),_We then subdivide the two-dimensional unit 

cube (the unit square) into, say, 100 squares (call them mini-squares), each of 
area 0.01. We count the number n, of pairs that fall into mini-square i, and use 
the y 2 test to check if these pairs are uniformly spread through the unit square. 
If correlation exists, some of the mini-squares will have a significantly higher 
concentration of pairs than the others (see Figure 10.5). ■ 


Permutation Test 

Given a certain sequence of numbers, divide them into non-overlapping subse¬ 
quences, each of a chosen length, k. Each of these subsequences can be in one of k\ 
possible orderings. If the URNG is good, we expect these orderings to be equally 
likely to occur, which can be checked using the y 2 test. 


■ EXAMPLE 

Consider the case k = 3 . Denote a subsequence by it ], 112 , 213. This subsequence 
has 3 ! = 6 possible orderings: n\ < 222 < 213; 221 < 213 < M2; »2 ^ Mi < M3; M2 < 
M 3 < mi; M 3 < Mi ^ M 2 ; and 2*3 < 222 < U\. If we generate a large number of such 
sequences, we expect a good URNG to generate each of these six orderings 
with a frequency of 1/6. If the frequency of at least one ordering differs sig¬ 
nificantly from 1/6 (as measured by the y 2 test), the URNG will fail this test. 
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The Spectral Test 

This is probably the most powerful test available. The approach followed by the 
spectral test is perhaps easiest to understand in two-dimensional space. Let us 
try to draw parallel lines in such a way that each point in the scatter plot is on 
one of these lines. Then, find the maximum distance between any two adjacent 
parallel lines. Let tF be the maximum of this quantity, taken over all possible ways 
in which such parallel lines can be drawn (the subscript refers to the fact that 
we are working in two dimensions). We define vi — l/cF as the two-dimensional 
accuracy of the URNG. The larger this quantity the better: the intuition behind 
this is that for large values of iy>, the points are spread out more "randomly" in 
two-dimensional space. 

This approach can be generalized to higher dimensions. In /c-dimensional space 
(where we would plot (X,,X; + ..., X, + ^-_ |)), we can replace the parallel lines by 
(k — 1 (-dimensional parallel hyperplanes and repeat the distance calculation. The 
quantity v^ — X/d^ (where d/ ( is defined for k dimensions as d -2 was for two) is the 
/c-dimensional accuracy of the URNG. 

It has been recommended to study the scatter up to about six dimensions and 
require that ly F 2 30 ' 7 ' for i = 2,3,4,5,6 to accept a generator as good. 

The only issue left is how to compute ly. The theory behind this is beyond the 
scope of this book; the user can download routines for running the spectral test 
from the Internet. 

10.4.3 Generating Other Distributions 

Given a URNG, we can easily generate random numbers that follow other distri¬ 
butions. There are a handful of standard methods for doing this. 

Inverse-Transform Technique 

This technique is based on the fact that if a random variable X has a probability 
distribution function F v (-), the random variable Y = Fx(X) is uniformly distributed 
over [0,1]. This can be easily proved as follows: 

Denote by F^ 1 the inverse function of Fx, that is, Fj/ 1 (Fx(i/)) = y. (If the inverse 
does not exist because there are multiple such quantities y, use the smallest such 
y). Then, for 0 ^ y < 1, 

ProbfY < y} = Prob{Fx(X) < y} 

= Prob {X < F^ 1 (y)} (because Fx( ■) is nondecreasing) 

= f x(Fx 1 (y)) 

= y 



350 


CHAPTER 10 Simulation Techniques 


Therefore, if we generate random numbers Yi, Y 2 ,... that are uniformly distrib¬ 
uted over [0,1], we will get random variables distributed according to Fx(-) by 
generating X; = F^Y;). 


■ EXAMPLE 

Suppose we want to generate instances of X, an exponentially-distributed ran¬ 
dom variable with parameter / 1 . The probability distribution function of X is 

Fx(x) = 1 - e _/iJ , x^O 


Now, define 

Y = F x (X) = l-e“ /iX 

and 

e _/iX = 1 - Y 

hence 


and finally 


-fiX = ln(l - Y) 


X = —(l//r) ln(l — Y) 


Thus, to generate exponentially distributed random numbers, first gener¬ 
ate uniformly distributed random numbers y over [0,1] and then output 
x — —(1 ///)!n(1 — y). The computation can be speeded up a little by recog¬ 
nizing that —(1 / n) lnt/ will also work; see the Exercises for details. ■ 


Working with discrete random variables is similar, as shown by the following 
example. 


■ EXAMPLE 

We are asked to generate a discrete-valued random variable V with the fol¬ 
lowing probability mass function: 


ProbfV = v} 


0.1 if 17 = 1 
0.3 if i; = 2 
0.6 if 17 = 2.25 
0 otherwise 
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The only values that V can take are 1,2,2.25. The corresponding probability 
distribution function is clearly 


F(v) — ProbfV < v} = 


0.0 if p < 1 
0.1 if 1 < v < 2 
0.4 if 2 < v < 2.25 

1.0 if 0 2.25 


This distribution function has jumps at v — 1, 2, and 2.25 and is flat otherwise. 
Now, generate a uniformly distributed random variable, U, over the interval 
[0,1], and output 


V = 


1 if 0< U <0.1 

2 if 0.1 < U <0.4 
2.25 if 0.4 < U 0.0 


Why is this an example of the inverse transform approach? See Figure 10.6, 
which contains the distribution of the function. Suppose we get U — 0.70 from 
our URNG. We then find the interval (F(2),F(2.25)) into which LI falls and output 
V = 2.25. 


1.00 

0.80 


0.60 

0.40 

0.20 


1 2 3 

V 


V 


a 

ft 


FIGURE 10.6 Generating a discrete random variable. 
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■ EXAMPLE 


Suppose we are asked to generate a nonhomogeneous Poisson process. This is a 
generalization of the well-known Poisson process; the only difference is that 
the rate of event occurrences is not a constant X but a function of the time f, 
denoted by X(t). The probability of an occurrence during the interval [t, t + df] 
is given by /.(f) df. Nonhomogeneous Poisson processes are useful in modeling 
components with failure rates that change with age. 

Our task now is to generate times at which events occur in such a process. 
We will do so by generating the time of the first event, then the time of the 
second event based on the time of the first event, and so on. 

To do this with the inverse-transform technique, we first need to compute 
the probability distribution function of the time between successive event oc¬ 
currences. The probability of no event occurrence in the time interval [t\, f 2 ] is 

given by e ^ /Jr>llr r and therefore, if the ith event occurred at time t,, the in¬ 
terval to the next event occurrence has the following probability distribution 
function 

F(x\ti) = l-e~ ft i +,l{T)dr 

Suppose, as an example, that X(t) = at, which means that the failure rate in¬ 
creases linearly as a function of time. Then, the distribution function of the 
time interval between the ith and (i + l)st events will be 


F(x\ti) — 1 — e 


rX+tj 

hi 


ax dr 


_ l _ e ~a[x 2 +2xti]/2 


To use the inverse-transform technique, we set 

u — l _ e ~a[x z +2xt i ]/2 


solving for x 

x — —ti + Jif — 21 n(l — u)/a 

This is the length of the interval separating f, and f !+ i. Thus, we will generate 
event times as follows. Generate U\, LI 2 , ■.., uniformly distributed over [0,1]. 


1. Set t 1 =^-2ln{l-U 1 )/a 

2. Set f 2 = h ~ h + — 21n(l — U 2 )/a — J— 21n(l — U^)/a 

3. Set t 3 = f 2 — ti + ^2 — 21n(l — U^)/a = J — 21n(l — U^)/a 


and so on. ■ 
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■ EXAMPLE 

Suppose we want to generate positive random variables distributed according 
to the Weibull distribution (see Equation 10.2), for which 

F(x) = 1 — e~ Axl) (for x > 0) 

We now have 

u = 1 - e~ Xx ^ 

and consequently, 

x = [—ln(l — 

■ 

Rejection Method 

Suppose we are given a random number generator that produces random num¬ 
bers according to a probability density function g(-), and would like to gener¬ 
ate random numbers according to a probability density function/(•) such that 
/(x) < cg(x) for all x and for some finite constant, c. Then, the rejection method 
proceeds as follows: 

1. Generate a random number, Y, according to the probability density function 

SO- 

2. Generate U , uniformly distributed over [0,1]. 

3. If U < ggy]' output Y; otherwise go back to step 1 and try again. The output 
has the required probability density function,/(-). 

The role of the constant, c, is to ensure that th ef(Y)/cg(Y) is never greater than 
1. We would like to select a function g(-) such that c is not very large; as you are 
invited to prove in the Exercises, the average number of times we have to loop 
through the above procedure to generate one output is c. 

We next prove that this method produces the desired results. 

Prob{X ^ x] = Prob Y ^ x U < ) 

q?(Y) ] 

ProbjY < x and U < ^yy} 

Prob{lf<g>} 
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Prob 


Y <x and U < 


m l 

cg(Y) J 


= Prob 




f(Y) 


cg(Y) 


Y ^ x [ ProbfY ^ x} 


F(x) 

= -(fill in the missing steps as an exercise) 


Prob 




f(Y) I _ 1 


cg(Y) J 


= - (showing this is another exercise) 


Hence, Prob{X ^ x} = F(x), which completes the proof. 


■ EXAMPLE 

Suppose we want to generate random variables Z according to the normal 
distribution, with mean 0 and variance 1. The desired density function is 

1 2 

h(z) — _ e~" ^ 2 , —oo < z < oo 

V27T 

We need to find a suitable function y(-). A URNG will not do: its density func¬ 
tion goes to 0 beyond a finite interval. However, we know how to generate 
an exponentially distributed random variable (with parameter 1): it has den- 
sity g(x) — e~ x for x ^ 0. The only problem is that the normal distribution is 
nonzero for both positive and negative z, and the exponential is only defined 
for x^O. 

This difficulty can be easily overcome: observe that li(z) is symmetric about 
the origin and li(z) — h{—z). Let us generate a random variable X = |Z|: it has 
twice the density of the normal over the non-negative half of the interval. This 
results in the density function 


f(x) — _ e~ A " /2 , 0 ^ x < oo 

V2jr 

Then, we set Z = X with probability 0.5 and Z = — X with probability 0.5. 

We start by finding a c such that f(x) g cg(x). To do this requires us to maximize 
f(x) /g(x) over x > 0: simple calculus shows that this happens when x = 1, so 
we can use 

= m = /5 

C X(!) V n 

After some algebraic manipulation, we get 

/(*) = e -(x-l) 2 /2 
cg(x) 


Therefore, to generate X, we carry out the following steps: 
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1 . Generate Y, with probability density function gy(y) — e !/ . 

2. Generate Ui uniformly distributed over [0,1]. 

3. If Hi < e ( ' 11 / 2 , output X = Y; otherwise go back to step 1 and try again. 
To generate Z from X, we do the following: 

1 . Generate I /2 uniformly distributed over [0,1]. 

2. If U 2 Y 0.5, output Z — X, otherwise output Z = —X. 


Composition Method 

When the random variable to be generated is the sum of other random variables, 
we can generate each of the latter and then add them up. 


■ EXAMPLE 

We want to generate a random variable Z which is defined as Z = V + X + Y, 
where: 

1. V is uniformly distributed over the interval [0,10]. 

2. X is exponentially distributed with parameter //. 

3. Y has the normal distribution, with mean 5 and variance 23. 

We generate V and X using the inverse transform technique, and Y using the 
rejection method. We then add them up and output the result. ■ 


10.5 Fault Injection 

As mentioned previously in this chapter, simulating a system to obtain its re¬ 
liability or similar attributes requires the knowledge of parameters such as the 
components' failure rates. These can be obtained either through lengthy obser¬ 
vations, or much faster through fault injection experiments. In such experiments, 
various faults are injected either into a simulation model of the target system or 
a hardware-and-software prototype of the system. The behavior of the system in 
the presence of each fault is then observed and classified. Parameters that can be 
estimated based on such experiments include the probability that a fault will cause 
an error, and the probability that the system will perform successfully the actions 
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required to recover from that error (the latter probability is often called coverage 
factor, see Chapter 2). These actions consist of detecting the fault, identifying the 
system component affected by the fault, and taking an appropriate recovery action 
which may involve system reconfiguration. Each of these actions takes time that is 
not a constant but may change from one fault to another and may also depend on 
the current workload. Thus, fault injection experiments, in addition to providing 
estimates for the coverage factor, can also be used to estimate the distribution of 
the individual delay associated with each of the above actions. 

In addition, fault injection experiments can be used to evaluate and validate the 
system dependability. For example, errors in the implementation of fault-tolerance 
mechanisms can be discovered, and system components whose failure is more 
likely to result in a total system crash can be identified. Also, the effect of the 
system's workload on the dependability can be observed. 

10.5.1 Types of Fault Injection Techniques 

Initially, fault injection studies involved injection of physical faults into the hard¬ 
ware components of the system. This necessitates being able to modify the current 
value of almost every circuit node, thus mimicking a fault that may occur there. 
With the considerable increase in circuit density in current VLSI technologies and 
the associated reduction in device size, this technique is now limited in its capa¬ 
bilities because only the pins of integrated circuits can be easily accessed. 

Accessibility can be improved by taking advantage of scan chains, which con¬ 
nect a large number of internal circuit latches in a sequential manner, and are cur¬ 
rently included in many designs of complex integrated circuits. Scan chains are 
normally constructed to simplify the debugging and manufacturing test of the cir¬ 
cuit by allowing the user to shift out the current values (for observation purposes) 
and shift in new values. By shifting in erroneous bits, the scan chain can be used 
to inject faults as well. 

Even so, injecting faults into all internal circuit nodes is not practically feasi¬ 
ble due to the very large number of circuit nodes in even a moderately complex 
system, which makes exhaustive insertion prohibitive. Instead, a subset of these 
insertion points must be carefully selected. 

Several alternative schemes have been developed to allow the injection of faults 
without having direct access to internal nodes. One such scheme is to subject the 
hardware to particle radiation (for example, heavy-ion radiation). Such radiation 
can clearly inject faults into otherwise inaccessible locations, but on the other hand 
it can only inject transient faults, because the effect of the particle hit will disap¬ 
pear after a brief delay. This technique has the additional advantage of closely 
resembling what might happen in real life. As device feature sizes in current in¬ 
tegrated circuits get smaller, errors due to neutron and alpha particle hits become 
more common. Such particle hits (also called soft errors or single event upsets) are 
abundant in space but also appear at ground level due to cosmic rays that bom¬ 
bard the earth and to radioactive atoms that exist in trace amounts in the packag¬ 
ing materials. 
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A different method for fault injection is through power supply disturbances. 
The supply voltage is briefly dropped to levels below the nominal range. Unlike 
the radiation method which usually generates single event upsets, this scheme 
affects many nodes in the circuit simultaneously, producing multiple transient 
faults. Moreover, the exact location of these faults cannot be controlled. The ef¬ 
fect of power supply disturbances does, however, resemble a real-life situation 
that may be experienced by computer systems in industrial applications. 

Another approach to fault injection is through electromagnetic interference. The 
system is subjected to electromagnetic bursts, which can be either allowed to affect 
all components or be restricted to individual ones. Here too, the injected faults are 
transient. 

The above-mentioned physical injection techniques rely on having a working 
prototype of the target system. If the designers wish to test some fault-tolerance 
features in their design and modify them if the observed dependability is insuf¬ 
ficient, then the use of a physical injection technique may prove to be too costly. 
An alternative would be to inject faults through the software layer. This technique, 
known as Software Implemented Fault Injection (SWIFI), can be applied either to 
a prototype of the target system or to a simulation model of it. SWIFI also over¬ 
comes some of the problems with physical fault injection such as repeatability and 
controllability. It provides easy access to many internal circuit nodes in the system 
(but not to all of them) and allows the control of the location, time, duration, and 
type of the injected faults much more easily than does physical injection. An im¬ 
portant advantage of the SWIFI approach is that it is not restricted to hardware 
faults but allows the injection of software faults as well. 

If SWIFI is applied to a simulation model of the target system rather than a 
prototype, then mixed-mode simulation techniques can be used, supporting sev¬ 
eral levels of system abstraction including architectural, functional, logical, and 
electrical. In mixed-mode simulation, the system is decomposed in a hierarchical 
manner, allowing us to simulate various components at different levels of abstrac¬ 
tion. Thus, an injected fault can be simulated at a low abstraction level and the 
propagation of its effect through the system can be simulated at higher abstraction 
levels, greatly reducing the simulation time. Although simulation-based fault in¬ 
jection has several desirable properties, injecting faults into a hardware-software 
prototype provides more realistic, credible, and accurate results. 

Software fault injections can be performed either during compilation or during 
run time. To inject faults during compile time, the program instructions are mod¬ 
ified and errors are injected into the source code or assembly code to emulate the 
effect of hardware (permanent or transient) and software faults. To inject faults 
during run time, one can use either timers (hardware or software) to determine 
the exact instant of the injection, or a software trap that will allow determining the 
exact time of the injection relative to some system event. This technique requires 
only minor modifications, if any, to the application program. A third method of 
timing the fault injection through software is by adding instructions to the applica- 
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TABLE 10-3 Comparing the properties of four approaches to fault 
injection 


Property 

Hardware 

direct 

injection 

Hardware 

indirect 

injection 

Software 

during 

compilation 

Software 

during 

runtime 

Accessibility 

low 

high 

low 

low to medium 

Controllability 

high 

low 

high 

high 

Intrusiveness 

none 

none 

low 

high 

Repeatability 

high 

low 

high 

high 

Cost 

high 

high 

low 

low 


tion program. This will allow faults to be injected in predetermined time instances 
during the execution of the program. 

10.5.2 Fault Injection Application and Tools 

Fault injection has been applied for measuring the coverage and latency parame¬ 
ters, for studying error propagation, and for analyzing the relationship between 
the workload of the system and its fault handling capabilities. Another interesting 
application of fault injection schemes has been to evaluate the effect of transient 
faults on the availability of highly dependable systems. These systems were ca¬ 
pable of recovering from the transient faults but still had wasted time doing that, 
thus reducing the availability. 

Various fault injectors have been developed and are currently in use. Some are 
mentioned in the Further Reading section. Studies comparing several fault injec¬ 
tors have been conducted, concluding that two fault injectors may either validate 
each other or complement each other. The latter happens if they cover different 
faults. 

The different approaches to fault injection result in quite different properties of 
the corresponding tools. Some of these differences are summarized in Table 10-3. 

All fault injection schemes require a well-defined fault model, which should 
represent as closely as possible the faults that one expects to see during the lifetime 
of the target system. A fault model must specify the types of faults, their location 
and duration, and, possibly, the statistical distributions of these characteristics. 
The fault models used in currently available fault injection tools vary considerably, 
from very detailed device level faults (for example, a delay fault on a particular 
wire) to simplified functional level faults (such as an erroneous adder output). 

10.6 Further Reading 

Two textbooks on simulation [12,29], provide useful information on how to write 
simulation programs. Another, more elementary and limited, source is the oper- 
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ations research book [19]. A large number of topics related to simulation models 
can be found in [4]. Many simulations are written in special-purpose simulation 
languages such as GPSS; for a good source for this language, see [30]. In our treat¬ 
ment, we did not discuss parallel simulation: this is a very promising approach to 
speeding up simulation. For details, see [14]. 

The topic of parameter estimation is covered in many books. See for example, 
[10,31]. [33] provides a readable section on the subject. 

Perhaps the best sources for variance reduction methods are the two above- 
mentioned books [13,29]. For importance sampling, see [15,18,26,27], These also 
contain a useful bibliography. [25] provides an early source for the technique of 
forcing. A case study of the use of importance sampling in evaluating real-time 
system dependability is presented in [11], 

An excellent source of information about uniform random number generators 
is [23]. You can find there a detailed mathematical treatment of the properties of 
the linear congruential generator, including the relationships that must hold in 
order to have P = m. Especially valuable is the detailed treatment of statistical tests 
of randomness that is provided. The theoretical basis for the / 2 test is explained in 
detail, and the most powerful test of all—the spectral test—is covered extensively. 
This book also has an outstanding set of references to the literature. Additional 
sources of information on empirical statistical tests are [5] and [21]. 

The recent work in [24] is useful for good random number generators with ex¬ 
tremely long periods. 

Generating random numbers with distributions other than uniform is discussed 
in many books. For example, see [5,29]. 

Several survey papers reviewing the uses of fault injectors and the various 
available tools have been published [9,20]. Some of the fault injection tools that 
have been developed rely on hardware fault injection, e.g., Messaline [2], FIST 
[16], Xception [8], and GOOFI [1]. Other are based on software fault injection, for 
example, Ferrari [22], FIAT [6], NFtape [32], and DOCTOR [17]. A good compari¬ 
son of several tools for evaluating the dependability of a fault-tolerant system was 
presented in [3]. Another use of software fault injection is to assess the risk in¬ 
volved in using a software product [34,35]. This scheme uses code that modifies 
the program state by injecting anomalies in the instructions to see how badly the 
software can behave. 


10.7 Exercises 

1. You are given a set of 10 processors that are believed to follow a Poisson failure 
process, with failure rate X per hour per processor. You run the processors for a 
week, and obtain the following numbers of failures for each processor: 2, 4, 2, 
1,1, 2, 3, 2, 0,2. 

a. What is your estimate for the value of X? 
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b. Construct a 95% confidence interval for X using Equation 10.1. 

C. Construct a 95% confidence interval for X using the fact that for the Poisson 
distribution E(x) — Var(x) = X. 

d. Explain the difference between the results of parts b and c. 

2. You are given a set of 10 processors that are believed to follow a Poisson failure 
process, with failure rate X per hour per processor. The prior density of X is a 
uniform distribution over the range [0.001,0.002]. 

a. You run these processors for 100 hours without any of the processors fail¬ 
ing. What is the best estimate for the value of X (the mean of the posterior 
density of A)? 

b. You continue the experiment for a total of 10,000 hours without observing 
any failures. What is your best estimate for X? 

C. Suppose you were to run this experiment for a very long time without any 
processor failing. What do you think the posterior density function for X 
would be? 

3. This question follows up on our comments on the difficulty of validating the 
reliability of a life-critical system to a sufficiently high level of confidence. 

Suppose you were calculating the confidence interval for the reliability of a 
life-critical system whose true failure probability over a given interval of op¬ 
eration is 10 -8 . (Of course, you don't knozv that this failure probability is 10 -8 , 
which is why you are gathering statistics). Obtain an estimate of the number of 
observations you would require to show with 99.999999% confidence that the 
true failure probability is in the range [0.9 x 10 -8 ,1.1 x 10 -8 ]. 

You will need, for this question, an algorithm to calculate the values of the 
normal distribution with sufficient accuracy. It should not be difficult to find 
one through an Internet search; see, for example, [7]. 

4. Evaluate RANDU, which was a routine widely used many years ago to gener¬ 
ate uniform random numbers. Its recurrence is X„ +1 = (65539X„) mod 2 31 . Pick 
Xo = 23 and use each of the testing methods described in Section 10.4.2. Soft¬ 
ware for the spectral test can be found on the Internet. 

5. Repeat Problem 4 for the random number generator that is included in your 
favorite computer system or spreadsheet. 

6. Given a uniform random number generator, obtain a generator for continuous¬ 
valued random variables with the following probability density functions. As¬ 
sume that the densities are 0 outside the specified ranges, and that pi, P 2 have 
known values. 


a. /i(x) = 0.25, 16^x<20 
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b. f 2 (x) 

c. f 3 (x) 

d. / 4 (x) 


0.4/xie MlX + 0.6/x2e ^ lX , x^O 


2 jX 4 e x , x > 0 

x if 0 < x < 1 

2 — x if 1 ^ x < 2 
0 otherwise 


7. Generate discrete random variables with the following probability mass func¬ 
tions (assume that the parameters have known values): 

a. ProbfX = n} — p(l — p)” _1 , n = l,2,3,...; 0<p<l 

b. ProbfX = n} — e~ x k n /n\, n = 0,1,2,...; k > 0 

if n — 1 
if n — 2 
if n — 3 
otherwise 

d. ProbfX = n} = 0.7e~ x k n /n\ + 0.3e“ 2A (2 k) n /n\ 


c. Prob{X — n} = 


10.25 

0.50 

0.25 

0 


8. When deriving the generator for exponentially distributed random variables, 
we showed that —(l/k)ln(l — U) would work. However, we pointed out 
that — (1/A) In id would also yield exponentially distributed random variables. 
Prove that this is the case. 

9. When proving the correctness of the rejection method, we omitted some steps. 
Complete the proof with these steps in place. 

10. Write a simulation program to obtain the MTTF of the system shown in Fig¬ 
ure 10.4a. 

11. Write a simulation program to find the MTTDL of a RAID Level 3 system, 
consisting of eight data disks and one parity disk. The disks fail independently, 
according to a Poisson process with rate 10 -7 per hour. The repair time (in 
hours) has an exponential density with mean 2 hours. 

a. Estimate the mean time to data loss, MTTDL. 

b. Derive the 99% confidence interval for the MTTDL after running a total of 
1000 simulation runs. 

C. Determine how many runs are required to make the width of the 99% con¬ 
fidence interval less than 10% of the estimated MTTDL (from part a). 

d. Vary the number of simulations from 1000 to 10,000, and plot the width of 
the confidence interval over this range. 
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12. Repeat the above simulation, using the method of antithetic variables. Com¬ 
pare the width of the 99% confidence interval you obtain with the two ap¬ 
proaches, for an identical total number of simulations ranging from 1000 to 
10,000. 

13. Repeat the above simulation, using the method of importance sampling. Use 
the balanced failure biasing technique. Vary the value of p* from 0.1 to 0.9, in 
steps of 0.1, and run 1000 simulations for each such value. Plot the width of the 
99% confidence interval as a function of p*. 

14. Consider the example discussed in Section 10.3.3. Suppose you carry out a 
few runs to get a rough estimate of n\ and m, and end up with Jt\ = 0.9 and 
7T2 = 0.98. Your simulation time budget allows you to carry out a total of 1000 
simulation runs, so that n\ + «2 = 1000. What values should you select for n \ 
and «2 to minimize the variance of your estimate of the survival probability, it. 

15. Consider the system shown in Figure 10.7. Each block suffers failure inde¬ 
pendently of the others, according to a Poisson process with rates k,\ — 0.001, 
k B = 0.002, k c = 0.005, X D = 0.01, k E = 0.009, k T = 0.005, and X P = 0.00001 per 
time unit. The subscripts refer to the block labels. The blocks marked 3 are per¬ 
fectly reliable and never fail. The nodes In and Out represent the input and 
output points, and not blocks: they do not fail. 

Each node takes an exponentially distributed amount of time to repair: the 
mean time to repair is 1 time unit for all nodes. 

Failure happens when there is no longer a path from the In node to the Out 
node. 

a. Write a simulation program to obtain the mean time to failure for this sys¬ 
tem. Plot the width of the 99% confidence interval associated with simula¬ 
tion runs ranging from 500 to 10000. 

b. Use the method of control variables and repeat part (a). 



FIGURE 10.7 Non-series parallel system. 
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C. Repeat part (a) by using importance sampling with balanced failure biasing 

(p* = 0.2). 

16. Repeat part (a) of the previous problem, with the blocks now suffering failure 
according to a nonhomogeneous Poisson process, in which the failure rates 
are increasing functions of time. Use A,(f) = for i e { A,B,C,D,E,P,T }. 

Assume that upon node repair, the effective age of that node becomes 0, i.e., 
that upon repair we reset its f to 0. 


References 

[1] J. L. Aidemark, J. P. Vinter, P. Folkesson, and J. Karlsson, "GOOFI: A Generic Fault Injection Tool," 
Dependable Systems and Networks Conference (DSN-2001), pp. 83-88, 2001. 

[2] J. Arlat, A. Costes, Y. Crouzet, J. C. Laprie, and D. Powell, "Fault Injection and Dependability 
Evaluation of Fault-Tolerant Systems," IEEE Transactions on Computers, Vol. 42, pp. 913-923, Au¬ 
gust 1993. 

[3] J. Arlat, Y. Crouzet, J. Karlsson, P. Folkesson, E. Fuchs, and G. H. Leber, "Comparison of Physical 
and Software-Implemented Fault Injection Techniques," IEEE Transactions on Computers, Vol. 52, 
pp. 1115-1133, September 2003. 

[4] J. Banks (Ed.), Handbook of Simulation, Wiley, 1998. 

[5] J. Banks, J. S. Carson II, B. L. Nelson, and D. M. Nicol, Discrete-Event System Simulation, Prentice 
Hall, 2001. 

[6] J. H. Barton, E. W. Czeck, Z. Segall, and D. P. Siewiorek, "Fault Injection Experiments Using FIAT," 
IEEE Transactions on Computers, Vol. 39, pp. 575-582, April 1990. 

[7] B. D. Bunday, S. M. H. Bokhari, and K. H. Khan, "A New Algorithm for the Normal Distribution 
Function," Sociedad de Estadistica e Investigacion Operation Test, Vol. 6, pp. 369-377,1997. 

[8] J. Carreira, H. Madeira, and J. G. Silva, "Xception: A Technique for the Experimental Evaluation 
of Dependability in Modem Computers," IEEE Transactions on Software Engineering, Vol. 24, pp. 
125-136, February 1998. 

[9] J. A. Clark and D. K. Pradhan, "Fault Injection: A Method for Validating Computer-System 
Dependability," IEEE Computer, Vol. 28, pp. 47-56, June 1995. 

[10] A. C. Cohen and B. J. Whitten, Parameter Estimation in Reliability and Life Span Models, Marcel 
Dekker, 1988. 

[11] G. Durairaj, I. Koren, and C. M. Krishna, "Importance Sampling to Evaluate Real-Time System 
Reliability," Simulation, Vol. 76, pp. 172-183, March 2001. 

[12] G. S. Fishman, Discrete Event Simulation, Springer-Verlag, 2001. 

[13] G. S. Fishman, A First Course in Monte Carlo, Duxbury, 2006. 

[14] R. M. Fujimoto, Parallel and Distributed Simulation, Wiley, 2000. 

[15] A. Goyal, P. Shahabuddin, P. Heidelberger, V. F. Nicola, and P. W. Glynn, "A Unified Framework 
for Simulating Markovian Models of Highly Dependable Systems," IEEE Transactions on Comput¬ 
ers, Vol. 41, pp. 36-51, January 1992. 



364 


CHAPTER 10 Simulation Techniques 


[16] U. Gunneflo, J. Karlsson, and J. Torin, "Evaluation of Error Detection Schemes Using Fault In¬ 
jection by Heavy-ion Radiation," 19th IEEE International Symposium on Fault-Tolerant Computing 
(FTCS-19), pp. 340-347, June 1989. 

[17] S. Han, K. G. Shin, and H. A. Rosenberg, "DOCTOR: An Integrated Software Fault Injection Envi¬ 
ronment for Distributed Real-time Systems," International Computer Performance and Dependability 
Symposium (IPDS'95), pp. 204-213, April 1995. 

[18] P. Heidelberger, "Fast Simulation of Rare Events in Queuing and Reliability Models," ACM Trans¬ 
actions on Modeling and Computer Simulation, Vol. 5, pp. 43-55, January 1995. 

[19] F. S. Hillier and G. J. Lieberman, Introduction to Operations Research, McGraw-Hill, 2001. 

[20] M. C. Hsueh, T. K. Tsai, and R. K. Iyer, "Fault Injection Techniques and Tools," IEEE Computer, Vol. 
30, pp. 75-82, April 1997. 

[21] R. Jain, The Art of Computer Systems Performance Analysis, Wiley, 1991. 

[22] G. A. Kanawati, N. A. Kanawati, and J. A. Abraham, "FERRARI: A Flexible Software-Based Fault 
and Error Injection System," IEEE Transactions on Computers, Vol. 44, pp. 248-260, February 1995. 

[23] D. E. Knuth, The Art of Computer Programming, Vol. 2, Addison-Wesley, 1998. 

[24] P. L'Ecuyer, "Random Numbers," International Encyclopedia of Social and Behavioral Sciences, Else¬ 
vier, 2001. 

[25] E. E. Lewis and F. Bohm, "Monte Carlo Simulation of Markov Unreliability Models," Nuclear 
Engineering and Design, Vol. 77, pp. 49-62,1984. 

[26] M. K. Nakayama, "Fast Simulation Methods for Highly Dependable Systems," Winter Simulation 
Conference, pp. 221-228,1994. 

[27] M. K. Nakayama, "A Characterization of the Simple Failure-Biasing Method for Simulations of 
Highly Reliable Markovian Systems," ACM Transactions on Modeling and Computer Simulation, 
Vol. 4, pp. 52-86, January 1994. 

[28] D. Powell, E. Martins, J. Arlat, and Y. Crouzet, "Estimators for Fault Tolerance Coverage Evalua¬ 
tion," IEEE Transactions on Computers, Vol. 44, pp. 261-274, February 1995. 

[29] S. M. Ross, Simulation, Academic Press, 2002. 

[30] T. J. Schriber, An Introduction to Simulation using GPSS/H, Wiley, 1991. 

[31] H. W. Sorenson, Parameter Estimation: Principles and Problems, Marcel Dekker, 1980. 

[32] D. T. Stott, G. Ries, M.-C. Hsueh, and R. K. Iyer, "Dependability Analysis of a High-Speed Net¬ 
work Using Software-Implemented Fault Injection and Simulated Fault Injection," IEEE Transac¬ 
tions on Computers, Vol. 47, pp. 108-119, January 1998. 

[33] K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 
John Wiley, 2002. 

[34] J. M. Voas and G. McGraw, Software Fault Injection, Wiley Computer Publishing, 1998. 

[35] J. Voas, G. McGraw, L. Kassab, and L. Voas, "Fault-Injection: A Crystal Ball for Software Liability," 
IEEE Computer, Vol. 30, pp. 29-36, June 1997. 



Index 


2-of-5 code, 66, 66t 

A 

ABFT. See Algorithm-based fault tolerance 
Absorbing state, 26 
Acceptance tests, 8 
for duplex systems, 28 
for software, 148-149 
output verification, 148-149 
quality of, 151 
range checks, 149 
timing checks, 148 
Accessibility, 356 
Accomplishment levels, 9 
Active replication, 101 
Adaptive routing, 110 
Adders 

in parity checking, 230 
with separate residue check, 76f 
AddRoundKey, 291, 302-303 
Adjacent bit errors, 68 

Advanced Encryption Standard (AES), 286, 307 
algorithm for, 291, 292f 
examples, 291-294 
fault injection in, 298 
key schedule of, 292f 
AES. See Advanced Encryption Standard 
AES SBox, 290 
Aging, 11-12 
Algorithm(s), 4, 8, 94,161 
bus-based cache coherence, 218f 
cache coherence, 218 
checkpointing, 210-211, 215, 219f 
for AES, 291, 291f 
for dynamic vote assignment, 94f 


routing, 135,140-142 
RSA, 295-296,303-304,304f, 305f, 307 
Algorithm-based fault tolerance (ABFT), 56, 99- 
102 

ALU. See Arithmetic and Logic Unit 
AN-codes, 75 

Antithetic variables, 328-330 
Application-level checkpointing, 198 
Arithmetic 
binary, 68 

floating-point, 101-102 
Arithmetic and Logic Unit (ALU), 160, 270 
Arithmetic codes, 74-79 
classes of, 75 
nonseparable, 75 
separable, 75 
Assertions, 39,173 

Asymmetric keys. See also Public keys, 285-286 
Atomic, Consistent, Isolated, Durable properties 
(ACID), 234 

Availability (A(t)). See also Point Availability, 111 
defined, 5 

of read/write quorum, 135 
Average Computational Capacity, 6 

B 

Back-to-back testing, 167,168f 
Backups, 98,101 

Balanced failure biasing, 337, 341 
Bandwidth, 141 
as network measure, 112 
of crossbar networks, 120-121 
Bathtub curve, 11-12,12f 
Battery backup, 195 
Bayesian approach, 186, 322-324, 324f 


365 



366 


Index 


Bayes's formula, 322 
Benign failures, 3-4 
Berger code, 66-67 
Bipartite graph, 265, 265f 
Bi-residue codes, 79 
BIST. See Built-In Self-Testing 
Block(s), 258 

distributed recovery, 171-173,172f 
nested recovery, 172 
recovery, 148,166 
Block ciphers, 286 
Block redundancy, 269, 269f 
Buffers. See also Translation Lookaside Buffers 
ECC-protected, 241 
overflow, 150-151 

Bugs. See also Debugging, 2, 8,148,151,160, 178, 
185 

causes of, 178 
correlated, 169 

Built-In Self-Testing (BIST), 264, 277 
Burst errors, 72 

Bus-based checkpointing algorithm, 219f 
Bus-based coherence algorithm, 218f, 219f, 224 
Butterfly network, 113-114,113f 
analysis of, 116-119 
extra-stage, 115f 
Bypass multiplexers, 114,119 
Byte-interlaced parity codes, 58 
Byzantine agreement, 46-47 
Byzantine failures, 3,41^6, 42f, 97,101 
Byzantine Generals, 42-46,48 

C 

Cache, 206-207, 223, 230 
coherence, 217-219 

Cache-Aided Rollback Error Recovery 
(CARER), 206-207, 217 
Canonical/resilient structures, 15 
M-of-N systems, 20-23 
NMR variations, 23-27 
non-series/parallel system as, 17-20 
series/parallel systems as, 16-17 
voters as, 23 

CARER. See Cache-Aided Rollback Error Recov¬ 
ery 

Cassini, 238-241, 240f, 247 
CCC. See Cube-Connected Cycles 
Central Limit Theorem, 315, 324, 327 
Centralized routing, 135 
CFG. See Control Flow Graph 
Channels. See also Side-channel information, 207 
I/O, 230, 233 
noisy, 4 


Checkpoint(s), 195 
compression, 205 
coordination, 210f 
counter, 218 
"dead", 205 
execution time for, 222t 
latency, 196,199,199f, 205, 223 
logical, 216 
number, 197, 218 

overhead, 195,199,199f, 204-205,223 
permanent, 211 
placement, 201, 223 
size, 196-197 
tentative, 211 
unmodifiable, 218 
useless, 208f 
Checkpoint level, 197 
application-level checkpointing, 198 
kernel-level checkpointing, 198 
user-level checkpointing, 198 
Checkpointing, 196f 
algorithms, 215, 219f 
application-level, 198 
coordinated algorithm, 210-211 
defined, 195-197 
diskless, 212-213, 223 
frequency, 206 
incremental, 204, 223 
in distributed systems, 207-217, 223 
in mobile computers, 224 
in real-time systems, 220-222, 224 
in shared-memory systems, 217-219 
kernel-level, 198 
sequential, 203, 205 
staggered, 215-217 
user-level, 198 

Checksums. See also Column checksum ma¬ 
trix; Full checksum matrix; Row check¬ 
sum matrix; Weighted-Checksum Code, 
39,64-65,102,232 
double-precision, 64 
Honeywell, 64-65, 65f 
residue, 64 

single-precision, 64-65, 65f 
Chip(s), 8, 257 
defect-tolerant, 249 
floorplans of, 272-276 
redundancy scheme of, 262 
VLSI, 249 

yield projection for, 259-263 
Chip-kill faults, 268 
Chordal network, 130f 




Index 


367 


Chords, 130-131 
Ciphers, 285,307 
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code, 205 
2-of-5, 66,66t 
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M-of-N, 65-66 
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residue, 75-76, 78 
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unidirectional error-detecting, 65 
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arithmetic codes, 74—79 
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Commercial Off-the-Shelf software (COTS), 150 
Compound Poisson distributions, 260-261 
Compound Poisson model, 261, 272 
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Conditional probability, 13-15 
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Consistent comparison problem, 161-163 
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Continuous-Time Markov Chain (CTMC), 336f, 
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simulating, 335-341 
simulating reliability, 341 
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Control variables, 330-331 
Controllability, 357 
Correlated failures, modeling, 84-88 
COTS. See Commercial Off-the-Shelf software 
COTS microprocessors, 158 
Coverage factor, 25, 27, 356 
Crashes, 154 

CRC. See Cyclic Redundancy Check 
CRC-16 polynomial, 73 
CRC-32 code, 73 
CRC-CCITT polynomial, 73 
Critical area, 251-253, 276 
Crossbar networks, 119-121,120f 
bandwidth of, 120-121 
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Cryptographic algorithms, 285, 286 
Cryptographic devices, 8-9, 285, 307 
modifying against attack, 300 
security attacks on, 296 
Cryptography, 307 



368 


Index 


CTMC. See Continuous-Time Markov Chain 
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theory of, 67-68 
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with generator polynomial, 69f, 72f 
Cyclic Redundancy Check (CRC), 73, 236 
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Data. See also Command and Data Subsystem; 
Mean Time to Data Loss 
accessibility, 55 
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errors in, 55 
loss, 80 

poisoning, 246 
redundancy, 99 

Data Encryption Standard (DES), 286, 289, 307 
fault injection on, 298t 
Feistel function in, 289f 
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Decryption, 286, 296 
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Defect-tolerant memory, 277 
Defect-tolerant microprocessors, 270 
Density function, 220, 353-354 
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DES. See Data Encryption Standard 
Design diversity, 39 
Device drivers, 238 
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Dirac delta function, 221 
Direct memory access (DMA), 204 
Directory-based protocol, 219 
Discrete-event simulation, 312 
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Disk(s), 195 
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failures, 87 
lifetime, 81f 
repair time, 81f, 87 
Disk mirroring, 213, 232 
Diskless checkpointing, 212-213, 223 
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Distributed recovery blocks, 171-173,172f 
Distributed routing, 135 
Distributed systems 
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Diversity factor, 158 
DMA. See Direct memory access 
Domain errors, 174,177 
Domino effect, 209-210, 209f 
Double-bit errors, 57 
Double-precision checksum, 64 
DSN. See Conference on Dependable Systems 
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DTMC. See Discrete-time Markov chain 
Dual-rail logic, 297 
Dual-redundant system, 239 
Duplex systems, 27-28, 27f, 32 
acceptance tests for, 28 
as code, 57 

forward recovery for, 29 
hardware testing for, 29 
pair-and-spare system for, 29 
triplex-duplex system, 29-30 
Dynabus interface, 232, 233 
Dynamic redundancy, 3, 24-25, 25f 
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ECCs. See Error-correcting codes 
EDCs. See Error-detecting codes 
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Data Encryption Standard, 286 
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ror syndromes; Single-bit errors; Unidi¬ 
rectional error-detecting code, 178 
adjacent bit, 68 
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domain, 174,177 
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profile, 166 
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range, 174,177 
rates, 181 
soft, 236,356 
spread of, 2 
syndromes, 100,101 

Error-correcting codes (ECCs), 55,245,267,277 
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use of, 301 
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Markov models, 33-36 
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propagation of, 175-176,176f 
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thrown, 175 

Exception handling, 173-174,186 
basics of, 175-177 
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External exceptions, 175 
Extra metal defects, 250 
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Fail-Fast operation, 229-230 
Failure(s). See also Mean Time Between Fail¬ 
ures; Mean Time to Failure; Nonfailure re¬ 
gions; Probability of failure; Single fail¬ 
ure tolerance; Single node failures, 2, 312, 
315 
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Byzantine, 41-46, 42f, 97 
cell, 267 
component, 335 
correlated, 84-88 
disk, 87 

hardware, 11-13 
malicious, 42 
network, 97 
node, 126-127 
point, 194 
processor, 8 
software, 4,160,165 
string, 85, 86f, 87 
switchbox, 117-118 
system, 335,336, 337 
timing, 174,177 
Failure rate 
formula, 12-13 
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Failure regions, 156,156f 
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Fast Fourier Transform, 99 
Fault(s), 1, 22 
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noninterferring, 240 
non-overlapping, 22-23 
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protection, 239 
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spread of, 2 
transient, 2,4, 240, 357 
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Fault Detection and Reconfiguration, 24-25 
Fault injection, 307 
application/tools, 358-359, 358t 
attacks, 9 
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security attacks through, 296-297 

types of, 356-358 
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tolerance; Single version fault tolerance; 
Software Implemented Hardware Fault 
Tolerance, 4-5 
hardware, 7-8,11 
in Stratus systems, 237 

Fault-tolerance process-level techniques, 36-37 
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watchdog processor, 37-39 
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Floating-point arithmetic, 101-102 
Floorplans, 271, 277 
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Full checksum matrix, 100 
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density, 220, 353-354 
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Gamma density function, 255, 317 
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fault tolerance, 7-8,11 
redundancy, 233 
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Hot-standby, 244 
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Hyeti microprocessor, 270, 277 
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fault-tolerant routing, 136-138 
reliability of, 142 
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calculating reliability of, 126-128 
cases for, 127-128 
labeling scheme of, 129 
node failures in, 126 
with spare nodes, 127f 
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IBM G5, 241-242, 247 
IBM Sysplex, 242-242, 243f, 247 
ICs. See Integrated Circuits 
Importance sampling, 333-342, 359 
example of, 334-335 
reasoning for, 333-334 
Incidental diversity, 165 
Inclusion/Exclusion formula, 261 
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Incremental checkpointing, 204, 223 
Inexact reexpression, 157 
Infant mortality, 15 
Information redundancy, 3, 8, 55-56 
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ECCs of, 245 
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failure during, 202 
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Intermittent fault, 2-3, 267 
Internal exceptions, 175 
Interrupt, 246 

Interstitial redundancy, 122,123f, 142 
Interval estimation, 315 
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Lisp, 166-167 

Littlewood-Verall model, 179-181 
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Machine Check Abort (MCA), 245 
Main memory, 204, 206, 212, 230, 246 
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Markov models, 8, 33-36, 35f 
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Mean Time Between Failures (MTBF), 5,13-15, 
336,341 
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defined, 5 
dependability, 124 
traditional, 5-7 
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Memory. See also Direct memory access; Ran¬ 
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Missing metal defects, 250, 25If 
MixColumns, 291, 302 
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Markov, 8, 33-36, 35f 
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One-way hash function, 287 

Online maintenance, 230 

Online transaction processing, 229 

Operating system (OS), 147,167,197, 233-234 

OPNET, 312 

Optimal checkpointing, 198-200 
accurate model for, 202-204 
placement, 201-202 

time between checkpoints—first order approx¬ 
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Hamming distance of, 57 
odd, 58 
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Public keys, 286,295-296,298-299,307 

R 
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RAID Level 2, 81-82 
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